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ABSTRACT 


Exploratory data analysis problems have recently grown in importance due to 
the large magnitudes of data being collected by everything from satellites to supermarket 
scanners. This so-called “data glut” often precludes the effective processing of 
information for decision-making. These problems can be seen as search problems over 
massive unstructured spaces. A prototypical problem of this type involves the search, by 
Department of Defense medical agencies, for a so-called “Desert Storm Syndrome” 
which involves large amounts of medical data obtained over several years following the 
Persian Gulf conflict. This data ranges over more than 170 attributes, making the search 
problem over the attribute space a hard one. We propose the use of genetic algorithms for 
the attribute search problem, and intertwine it with search algorithms at the detailed data 
level. Computational results so far strongly suggest that our system has succeeded at the 
given tasks, requiring relatively few resources. They also have found no indication that a 
single syndrome or other medical entity is responsible for wide-spread adverse health 
ramifications among a significant cross-section of Persian Gulf War participants in the 
CCEP program. There are, however, numerous correlations of exposure/demographic 
information and associated symptoms/diagnoses which suggest that smaller groups may 


share common health conditions based on shared exposure to common health risk factors. 
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I. INTRODUCTION 


A. ANALYSIS OF LARGE DATABASES 


Twenty years ago, computers were relatively scarce and applied to limited, highly 
specialized applications. At that time, there were rarely enough computerized data to make them 
an integral part of any organization’s decision-making process, As technology approached the 
present day, automated information systems became more capable and more involved in daily 
life. They began capturing more and more data, allowing the computer to become an active 
participant in expanding facets of daily decision-making. The exponentially increasing volume of 
available data has transformed the decision challenge from one of “data starvation” to “data 
saturation.” Fayyad, Piatesky-Shapiro, Smyth, and Uthurusamy (Fayyad, et.al., 1996, pp. xv- 
xvi) attribute this “mountain of stored data” to such factors as advances in scientific data 
collection, introduction of bar codes, and the computerization of many business and government 
transactions. In many situations today, there is so much data that human beings are unable to 
correlate it all, and decision quality is again hampered, or in the words of John Naisbett (Fayyad, 
et.al., 1996, p. xv.), “We are drowning in information, but starving for knowledge.” 

Clearly there is a growing need for “intelligent agents,” or automated information 
systems that can sift through these mountains of data (which other systems have efficiently 
collected) and integrate these sources into concise, usable knowledge for use in human decision- 
making. It is doubtful that a computer can reproduce the innovative creativity of a human 
analyst, but a computer system can be imparted with a basic representation of some of what the 
human analyst desires. This representation of interest is then used to filter vast volumes of 
available data (a task too time consuming for humans) and present the human analyst with a 
more concise body of knowledge in an understandable form. This premise is supported by many 
documents, such as this quote from Fayyad, et. al.: 


Such volumes of data clearly overwhelm the traditional manual methods of data 
analysis such as spreadsheets and ad-hoc queries. Those methods can create 
informative reports from data, but cannot analyze the contents of those reports 
to focus on important knowledge. A significant need exists for a new generation 
of techniques and tools with the ability to intelligently and automatically assist 


humans in analyzing the mountains of data for nuggets of useful knowledge. 

These techniques and tools are the subject of the emerging field of knowledge 

discovery in databases (KDD). (Fayyad, et.al., 1996, p. 2) 

The Comprehensive Clinical Evaluation Program (CCEP) database presents this type of 
challenge to data analysis. The CCEP database contains vast amounts of information on over 
19,000 Persian Gulf War (PGW) veterans who have brought some form of health concern to the 
attention of the Department of Defense (DoD) military healthcare system. The database contains 
a large number of attributes, and there are still no defined parameters for search. In any case, 
because of problem structure and sheer size, the entire database cannot be comprehensively 
analyzed by conventional means. The goal of this thesis is to design, construct, and implement 
an artificially intelligent computer system which can analyze the CCEP database more efficiently 
than a conventional or “brute force” approach without unduly taxing scarce medical research 
assets. Such computer systems are said to carry out “data mining.” 


B. PURPOSE OF THIS RESEARCH 


The ultimate purpose of this research is provide the CCEP program with a viable 
methodology to obtain useful information from its database of participating PGW veterans. 
Determining what constitutes “useful” or “interesting” information is at least as great a challenge 
as devising an analysis tool. However, in the initial stages of medical research, interesting 
information is any statistical association between database attributes of different categorical 
groups. These associations may signal the existence of an undiscovered common ailment or 
"syndrome" affecting participants in the Persian Gulf War. 

Time and other resources are also key factors in the overall CCEP research project. 
Simply investigating every possible combination of attributes may be theoretically feasible, but 
in actuality often necessitates an impractically large commitment of resources to the analysis 
task. Therefore, investigative speed and efficiency have become key factors in this research. The 
need for speed and efficiency demand that this research develop an intelligent search device 
capable of sifting through vast amounts of raw data and identifying interesting trends or 


correlations without the need for human intervention. Consequently, a genetic algorithm has 


been selected. No commercial product suited our particular needs, so the purpose of this research 
includes the development and application of a genetic algorithm suited to analysis of medical 
data, specifically the CCEP database. 

Finally, this research evaluated the success of the new genetic algorithm (DaMI, the NPS 
Data Miner) from several aspects: 

е DaMI performance adheres to classical genetic algorithm theory 

• DaMI statistical computations are valid and reproducible 

е DaMI efficiently and comprehensively analyzes the search space 

е Outcome hypotheses are of significant value to medical experts and the program 

sponsor 
As with problem structuring, validation of results has proven to be a major research challenge 
and is addressed in this paper. 

Computational results so far strongly suggest that our system has succeeded at the given 
tasks, requiring relatively few resources. They also have found no indication that a single 
syndrome or other medical entity is responsible for wide-spread adverse health ramifications 
among a significant cross-section of Persian Gulf War participants in the CCEP program. There 
are, however, numerous correlations of exposure/demographic information and associated 
symptoms/diagnoses which suggest that smaller groups may share common health conditions 
based on shared exposure to common health risk factors. 


C. SCOPE OF RESEARCH 


This research examines the problem structuring challenges for analyzing the data 
contained in the CCEP database. It discusses the general qualities of genetic algorithms and the 
specific techniques used to apply a genetic algorithm to the study of the CCEP database. The 
research focuses on application of a genetic algorithm to a relevant real-world problem and does 
not contain an in-depth description of genetic algorithm theory. An original genetic algorithm 
(DaMI) was created by this research effort. A technical description of the DaMI algorithm, its 


development process, and evaluation methodology are included. It is not the purpose of this 


research to survey all possible solutions to the CCEP analysis challenge, but rather to completely 


examine and document one apparently successful solution. Finally, the results of the DaMI 


analysis of the CCEP database are presented along with the validation process and 


recommendations for further research. The following research questions were addressed: 


D. 


If there is a (actually there may be more than one) common ailment or “syndrome” 
afflicting veterans of the Persian Gulf War, how will it manifest itself within the 
scope of information gathered by the CCEP database? 

How will the subjective concept of interesting information (to the medical 
community) be quantitatively measured and used to compare the “fitness” of 
different hypotheses? 

How should the research problem and database be structured to facilitate automated 
analysis? 

Why is a genetic algonthm a more effective means of analyzing the CCEP search 
space than other more conventional methods? 

How was DaMI constructed? What were the design considerations and key 
innovations in this particular genetic algorithm? 

What analyses were conducted and what were the results? 

Were the results validated and were they useful to the project sponsor (CCEP, 
Deployment Surveillance Team) and CCEP medical researchers? 


REAL WORLD APPLICABILITY 


A great deal of research has been performed on genetic algorithms and related artificial 


intelligence-based research tools. In many cases, the data analyzed were real but in few cases the 


research was tied into a real world time-sensitive research problem. One of the primary reasons 


for using a genetic algorithm is that an answer is needed, but conventional research resources are 


not available to produce that answer within the allotted time. This makes a study of a real-world 


genetic algorithm development all the more interesting. The CCEP database research is highly- 


visibile, relevant, and time-sensitive. 


Only a select number of medical issues have received as much attention as the proverbial 
“Desert Storm Syndrome” in recent years. Since the first retuming Persian Gulf War (PGW) 
veterans began reporting health issues, this subject has received constant attention by the U.S. 
government, military medical researchers, and most prolifically the media. A Presidential 
commission has been appointed to determine what, if any, health ailments may be attributed to 
the service of U.S. armed forces in the Persian Gulf. Research efforts continue at many DoD and 
Veterans Administration (VA) facilities. It is certainly appropriate to say that the CCEP is “high 
visibility.” 

Similarly, the concept of relating diseases to groups of humans with similar symptoms 
and life experiences (demographics and exposure to physical objects) bas been a focus of medical 
research for many years. Some of the earliest genetic algorithm experiments attempted to relate 
symptoms to diagnoses. Medical science has consistently searched for better ways to answer the 
question, “What caused this disease?” In the case of CCEP, 697,000 veterans (not to mention 
their families) are eager to know if their service in the PGW increases their susceptibility to any 
type of medical malady. From an academic perspective, the issue of automatically identifying 
“interesting” information has become increasingly fascinating and challenging. Technology has 
increased researchers’ ability to automate aspects of a medical situation, but the problem of 
making a model that accurately reflects the information remains. 


E. THESIS METHODOLOGY AND ORGANIZATION 


This research begins with examination of the CCEP research challenge as a whole. The 
first challenge is to structure the CCEP research question of what is an “interesting” hypothesis 
into a mathematical formula (fitness function). This in tum retums a higher “fitness” to 
hypotheses of greater interest to CCEP medical researchers. Our research tried many 
alternatives, but settled on the use of the Modified J-measure (described in section I.E.4.c) to 
assess relative independence between premise and outcome variables. The CCEP database was 
not designed with medical research in mind, so the second challenge was to reformat the database 
into a structure which supported automated analysis. 


Once the problem and source database were structured appropriately, a suitable research 
tool was needed. It was clear that using a “brute force” approach to examine the CCEP database, 
even using computer simulation, was impractical because of the tremendous size of the search 
space. A genetic algorithm was chosen because of the innate ability of genetic algorithms to 
inductively adapt to the researcher’s goals and to intelligently analyze a search space, bypassing 
hypotheses which show little chance of future success. Our concept enhanced the conventional 
genetic algorithm approach by dividing the process into two modules: A genetic operator, which 
handles selection and recombination of hypotheses at the field level only, and a statistical 
package, which analyzes every possible combination of hypothesis fields passed from the genetic 
operator and returns an integrated fitness measure for the entire hypothesis. Additionally, our 
tool examines multiple independent and dependent (LHS and RHS) fields because CCEP could 
not determine which field or combination of fields would identify a target outcome. 

Finally, the problem of validation and search space coverage must be addressed. A great 
deal of literature supports the idea that a genetic algorithm can deduce hypotheses that apply to a 
database. However, it is critical that these results be both validated against independent data and 
that they be indicated to accurately address the research question, instead of just exploring the 
data actual set analyzed. Several tools were developed to validate the results, among them an 
independent validation algorithm which independently re-tests results hypotheses against the 
subject database and a cross-validation procedure that tests hypotheses generated from one - 
randomly-sampled subset of the databases against another randomly sampled subset. 

The thesis is divided into seven chapters: 


e Chapter I : Introduction 

e Chapter II : Description of the CCEP Research, the database itself, and problem 
structure challenges 

e Chapter III : Overall solution concept and high-level research approach 

e Chapter IV : Description of the DaMI algorithm, its design, implementation, and 
validation processes 

e Chapter V : Technical description of the DaMI algorithm operators, innovations, and 
procedures 


e Chapter VI: Summary of results 
e Chapter VII: Conclusion and recommendations for future research 
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Il. COMPREHENSIVE CLINICAL EVALUATION PROGRAM 


A. BACKGROUND AND HISTORY OF CCEP 


The Department of Defense (DoD) began to examine the health consequences of Persian 
Gulf War (PGW) service while U.S. troops were still deployed to the Persian Gulf Region. The 
initial focus of medical researchers was on the health risks associated with smoke from Kuwaiti 
oil fires. As early as 1992, groups of PGW veterans began presenting with health complaints 
which they attributed to PGW service. Many of these veterans reported nonspecific symptoms or 
those not directly attributable to a specific disease or syndrome (group of commonly occurring 
symptoms/conditions). This sparked the first of many tests (first by the Army in 1992 and 
subsequently by other services) to attempt to discover if these non-specific symptoms could be 
linked with any “clusters” of PGW veterans. The theory of this approach is that a new syndrome 
will present as a “cluster” or group of individuals sharing some common trait (demographics, 
location, action, exposures, etc.) who also share a similar group of symptoms. (CCEP, 1996, pp. 
6-7) This is the first step to identifying a new syndrome. Once a syndrome is defined, then 
medical researchers begin efforts to find the cause of the syndrome. Ifa solid cause-effect 
relationship is established and documented between an entity (virus, bacteria, etc.) or health nsk 
factor(s) (like smoking or cholesterol), then the syndrome may be considered a full-fledged 
disease. 

In response to the health concerns of PGW Veterans, both DoD and Veterans Affairs 
(VA) established similar comprehensive clinical evaluation programs. The data for this research 
comes from the DoD CCEP. The CCEP program was officially enfranchised by the Assistant 
Secretary of Defense (Health Affairs) as part of a three-point plan, announced on 11 May 1994. 
This plan included: 


e The development of an aggressive, comprehensive, clinical diagnostic program to 


offer intensive examinations to veterans who do not have clearly defined diagnoses, 


е An initial independent review of DoD clinical and research efforts concerning the 
Persian Gulf War by Dr. Harrison C. Spencer, Dean of the Tulane School of Public 
Health and Tropical Medicine, New Orleans, Louisiana, and 

e The creation of a forum of national medical and public health experts to review, 
comment, and advise DoD concerning the results of the clinical evaluation program. 


(Joseph, 1994) 


CCEP continues to offer in-depth medical examinations, through the Military Health Services 
System (MHSS) to any PGW veteran having health concerns. Over 27,000 PGW veterans and 
their dependents have initiated medical examinations with CCEP, of which over 19,000 have 
been completed by the participants. The data collected from these 19,000 participants has been 
recorded in a single database (the CCEP database), which is the source database for this research. 
(CCEP, 1996, pp. 7 - 12) 

Since the inception of CCEP, numerous medical research programs have been conducted 
by DoD and non-DoD health organizations (including the Defense Science Board, National 
Institute of Health, Naval Health Research Center in San Diego, University of Califomia, 
Department of Health and Human Services, and National Academy of Sciences). Although 
several research efforts are still ongoing, the possibility of an unknown syndrome or disease 
affecting PGW veterans and their families has been exhaustively examined. DoD has committed 
to continue research on this issue but stated: 


To date, there is no clinical evidence for a previously unknown, serious illness 
or ‘syndrome’ among Persian Gulf veterans participating in the CCEP. A 
unique illness or syndrome among Persian Gulf veterans evaluated through the 
CCEP, capable of causing serious impairment in a high proportion of veterans 
at risk, would probably be detectable in the population of 18,598 patients. 
However, an unknown illness or a syndrome that was mild or affected only a 
small proportion of veterans at risk might not be detectable in a case series, no 
matter how large. (CCEP, 1996, p. 4) 


It is this viewpoint that has catalyzed the need for an intelligent, automated search program to 
analyze the CCEP database. Clearly, conventional research (user-controlled query and clinical 
evaluation) has reached the limit of available resources, and yet there is still a possibility that a 


syndrome has remained undetected. Proper implementation of a genetic algorithm can expand 


10 


the horizon of research by sifting through hypotheses not yet considered but will do so using 
small amounts of time, funds, and human effort. 


B.  CCEP RESEARCH VISION 


The core of CCEP research is based on classic epidemiological technique. The CCEP 
database has been constructed to capture as wide a range of data about PGW participants as is 
practical. Data collection practices have been standardized and unbiased--any participant with a 
concern undergoes the same health screening and examination process. The basic premise of 
analysis is that a new syndrome will present as "prominent and consistent physical and 
laboratory findings" like Legionnaire's disease or toxic shock syndrome or consistent “non- 
specific symptomatology” as with chronic fatigue syndrome and fibromyalgia. 

In any case, CCEP research efforts focus on slicing the database in many different 
directions, whether by demographic information, symptoms, diagnoses, or reported exposure 
categories. Percentages of PGW participants in each slice or "cluster" (which is a group of 
participants with the same characteristics within a given research slice) are compared to the per- 
centage expected within a similar population not participating in the PGW. In many cases 
(especially when the database is sliced by reported exposures), no comparable group is available, 
so these percentages are compared against actual percentages or distributions among all 697,000 
PGW personnel (as opposed to just those participating in CCEP). The point of the analysis is to 
isolate any characteristic which appears to make a CCEP participant more likely to have 
approached CCEP with a medical condition. 

If some specific combination of demographics, personal habits (smoking/non-smoking), 
and reported exposure is associated with specific symptoms and diagnoses with the group of 
CCEP participants, then medical research is developed to clinically test the relationship of these 
factors to personal health. It should be apparent that this approach is extremely resource 
intensive. Analysis dimensions are limited to the imagination of individua! researchers 
developing the slices and the physical ability of medical researchers to examine the hypothesis. 
If the quality of "statistical interest" could be mathematically modeled by an automated research 
tool, then the dimensions of analysis could be expanded to the limits of computer (as opposed to 
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human) resources. The genetic algorithm (DaMI) is a research tool designed specifically to 
relieve humans from the drudgery of human-controlled analysis so that they may focus efforts on 
clinical testing which machines cannot do. 


C. DATABASE DESCRIPTION 


The CCEP Haben, a “flat file” or single table with 177 attributes. It was created in 
standard dBase® format and was actually received and manipulated using the Visual Foxpro® 
Database Management System (DBMS). The database was not designed with automated 
analysıs or medical research (for that matter) in mind. Therefore, a great deal of manual file 
manipulation was required before automated analysis was possible. By “manual” we mean the 
issuance of single SQL® commands to reformat individual database schema and field values. At 
no time was the actual data adjusted, but in many cases the representation schema was changed 
to enhance automated processing. Appendix A contains the CCEP data dictionary alone, a 
commentary on modifications/usability of each field, and a synopsis of the CCEP data collection 
process. The actual database used for research contains 17,033 records for active duty CCEP 
participants. Dependent records were removed prior to analysis at the request of the CCEP 
program manager. 

A large number of attributes containing administrative and/or privacy act data were 
removed from the database and other attributes were added to enhance the schema, as discussed 
above. (For a more complete description of schema modifications, see section IL. D.2) In all, 140 
attributes were present in the research database. Not all were examined at once (see Section 
УТ.А), but in any case the database was relatively large by medical or occupational health 
research standards. The remaining attributes fall into four major categories: 


e Demographic. Physical attributes of each participant (e.g. race, gender, age, home 
state, service component, Unit Identification Code [UIC]) 

e Reported Exposures. Reported exposures to potentially hazardous environmental 
conditions by participants (e.g. botulism vaccine, oil smoke, uranium, passive 


smoke, local water, SCUD attack) 
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е Reported Standard Symptoms. Standard symptoms elicited by physicians during 
CCEP medical examinations (e.g. difficulty breathing, fatigue, headaches) 

e Diagnoses. Each participant completing the entire CCEP medical examination 
process was assigned a primary and up to six secondary diagnoses. Diagnoses 
followed the standard numeric ICD coding system (e.g. V65.5 - Healthy Exam, 
307.81 - Chronic Muscle Tension Headaches, 780.71 - Fatigue) 


As will be seen in later sections, most analysis was conducted on associations between these 


major attribute categories. 


D. WHY DOES A GENETIC ALGORITHM WORK FOR CCEP 
ANALYSIS? 


ie Theory 


The theory of genetic algorithms was invented by John Holland in the early 1970’s. 
Holland’s purpose was to create a search method based on the process of natural selection 
observed in nature. He likened the attributes making up a hypothesis in a search problem to 
chromosomes which “encode” a living being. He proposed that by creating mathematical 
representations of genetic reproduction and applying natural selection, scored by a fitness 
function, to those representations, he could create an adaptive search engine. Automation of this 
process has proven to be an excellent task for computer systems. Although a great deal of 
evolution is not understood, several general features are agreed upon: (Davis, 1991, pp 2 - 3) 


e Evolution is a process that operates on chromosomes rather than on the living beings 
they encode. 

e Natural selection is the link between chromosomes and the performance of their 
decoded structures. Processes of natural selection cause those chromosomes that 
encode successful structures to reproduce more often than those that do not. 
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e The process of reproduction is the point at which evolution takes place. Mutations 
may cause the chromosomes of biological parents, and recombination processes may 
create quite different chromosomes in the children by combining material from the 
chromosomes of two parents. 

e Biological evolution has no memory. Whatever it knows about producing 
individuals that will function well in their environment is contained in the gene pool- 
-the set of chromosomes carried by the current individuals--and in the structure of the 


chromosome decoders. 


If one is to follow the theory of natural selection, then it could be inferred that attributes used to 
make hypotheses are the operators of evolution. The process of hypothesis evolution revolves 
around the combination of those constituent attributes of successful hypotheses and their 
resulting recombinations. Furthermore, these recombinations are directed blindly and guided 
only by the principle that attributes belonging to hypotheses of higher fitness measure are 
recombined more frequently than attributes belonging to hypotheses possessing lower fitness 
measure. 

Holland went on to create three genetic operators which could mathematically recombine 
the modeling chromosomes of coded hypotheses to mimic genetic recombination. Hypotheses 
from the gene pool of the current are “selected” with a bias towards hypotheses with higher 
fitness measures, and then operated on by one of these three genetic operators: 


e Reproduction. Asexual reproduction of single parent rule to single offspring rule 
without modification 

e Crossover. Sexual reproduction involving the exchange of chromosomes between 
two parents producing two different child rules. 

e Mutation. Asexual reproduction of single parent rule with random modifications 
resulting in a different child mule. 


Using the “Two-armed and k-armed bandit problems,” (see Holland, 1975 for complete proof) 
Holland went on to prove that, lacking prior knowledge of the expected value of two or multiple 
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choices, allocating slightly more than exponentially increasing trials to choices with the highest 
past success is the optimal means for choosing between options. The results of this theory and 


its relation to genetic operators is summed up well by Goldberg: 


In other words, to allocate trials optimally (in a sense of minimal expected loss), 

we should give slightly more than exponentially increasing trials to the observed 

best arm...Another method that comes even closer to the ideal trial allocation is 

the three-operator genetic algorithm discussed earlier. The schema theorem 

guarantees giving at least an exponentially increasing number of trials to the 

observed best building blocks. In this way the genetic algorithm is realizable yet 

near optimal procedure (Holland, 1973a, 1975) for searching among alternative 

solutions. (Goldberg, 1989): 

It is important to reiterate that genetic algorithms gain their speed, not by analyzing an entire 
search space, but from deciding which attributes (chromosomes) hold the least probability of 
producing interesting hypothesis and not testing hypotheses using those attributes. The process 
is not fixed, for it relies on probability for modeling, and different results will be derived each 
time the algorithm is run. This fact will be discussed further in the discussion of results 
validation. 

Now let’s bring this theory closer to the current research question. A hypothesis 
conceming the CCEP database may be “encoded” into a string representing its constituent 
attributes. If one is to hold with Holland’s theory, then the attributes (in this case demographic, 
exposure, symptom, or diagnosis) which make up the hypothesis (іп а group or hypotheses) 
having the highest fitness measure should be recombined in an exponentially increasing number 
of fashions. Similarly, the attributes from unsuccessful hypotheses should be recombined 
exponentially less often. Genetic operators, used in the DaMI genetic algorithm, prove be the 
most optimal way of accomplishing this selection. Finally, if this process is followed, then the 
extremely large search space of correlations within the CCEP database will be searched most 
efficiently using a genetic algorithm. It is on this theoretical basis that we chose a genetic 


algorithm to analyze the CCEP database. 
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2 Advantages and Disadvantages of the Genetic Algorithm Method 


There is a great deal of theoretical literature on the advantages and disadvantages of 
using genetic algorithms. It is the intent of this section to relate practical lessons learned from 
our specific research using DaMI on the CCEP database. From the point of view of this research, 
a genetic algorithm was particularly useful because of its ability to process tremendous amounts 
of data and its lack of need for human interaction. It has already been proven that CCEP problem 
search space is too large to analyze by conventional means, even with a computer. The problem 
cannot be structured strongly enough to limit the possibilities to realistic numbers, so technology 
is being relied upon to perform the discrimination. Medical research assets are a scare resource, 
so employing medical experts only at the fitness function creation and final analysis stages 
produces efficient and effective results. Should preliminary implementation of genetic 
algorithms prove informative in this area of medical research, many other similar research 
questions may benefit from this technology. 

There are several disadvantages to using genetic algorithms, several to which have 
already been alluded. First, as can be seen from section II.D, a great deal of effort must be 
committed to database structure and normalization before processing. Since the system relies on 
computer evaluation of data, the data structure and coding scheme must be uniform and 
conducive to information extraction. Non-descriptive representations and textual data collection 
will severely curtail system performance. The strong coding and standardization of the CCEP 
database was one of the aspects that made it so attractive for this type of research. Second, a 
genetic algorithm is useless without a single, unambiguous representation of what is interesting 
to the operator. This was a key challenge to this research. There are many measures which may 
infer the “interestingness” of a particular hypotheses, but the synthesis of a single aggregate 
measure which satisfies all components of epidemiological interest has been extremely difficult 
(several different fitness functions may be required). Finally, a difficult paradox arises when 
attempting to prove that a genetic algorithm has completely searched a large space. A genetic 
algorithm achieves its speed advantage by selective analysis, meaning it selectively eliminates 
search options with, apparently, little chance of yielding interesting results. The only way to 
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actually prove that an interesting hypothesis was not missed is to physically test every 
hypothesis, but we turned to the genetic algorithm because the resources necessary to search the 
entire space were not available. To address this problem, the genetic algorithm is run several 
times. Ifthe outcomes produced by several independent runs have a high intersection 
(particularly among hypotheses of high fitness), then there is strong evidence that the space has 
been searched adequately. A more detailed discussion of this challenge is included in Chapter V. 
To sum up, this research has found that genetic algorithms do search a very large space 
of alternatives very quickly and efficiently. Successive generations of hypotheses quickly 
improve in quality as measured by the fitness function, and therefore the algorithm does adjust its 
search to the operator's goals. Strong database standardization and coding are a must before any 
processing is attempted. A genetic algorithm has proven successful to this research, as long as a 


fitness function can be created which accurately defines “what is interesting” to the researchers. 


E. KEY CHALLENGES TO CCEP ANALYSIS BY A GENETIC 
ALGORITHM 


1. Problem Structure 


The single most challenging aspect of this research is that “Persian Gulf Syndrome” as it 
is referred to by the media, PGW veterans, and some researchers, is not yet really a defined 
syndrome at all. A syndrome must be defined by a unique series of symptoms and/or ailments 
which are shared by a specific group of individuals. Although many PGW veterans report a wide 
array of non-specific medical ailments associated with PGW service, no defined set of 
symptomatology has been enstantiated as a candidate syndrome. 


CCEP clinicians have identified a wide range of specific diagnoses (i.e. 

migraine headache, depression, asthma, arthritis, hypertension). However, few 
if any of the conditions diagnosed to date could be considered specific for any of 
the many different exposures implicated as potential causes of Persian Gulf 
illnesses. Thus as a case series, the CCEP has identified a wide spectrum of 
different clinical conditions rather than any singular homogeneous diagnostic 
entity (CCEP, 1996, p. 79) 
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While the medical implications of this statement are serious, the impact of this situation on 
research is tremendous. Basically, CCEP medical researchers cannot provide us with a 
description of a target syndrome for research, or for that matter ifthere are one, many, or any 
syndrome(s) at all. Without target syndrome characteristics, a researcher is unable to identify 
which field or combinations of fields within the database indicate a desired outcome (a syndrome 
of interest). In truth, researchers do not know if the data necessary to identify a syndrome, 
should one exist, is contained in the database at all. Therefore, we have been compelled to 
develop a tool which can examine “interesting” associations between any number of causative 
and outcome attributes without specificity as to the limits of either the causative or outcome 
space. This is both a curse and a blessing; the lack of specifics makes the problem considerably 
more challenging but also stimulates interest in our type of tool. 

What can be reasonably asked about the problem is the following: 


e Is there a syndrome? Is there subset a (of A) ailments such that the occurrence rate 
of a in PGW participants (G) is higher than the rate in a reference population (R)? 
[#a(G) equates to “number of occurrences of an ailment within the set of participants 
(G)] 


#a(G) E #а(К) 
ЖО) (К) 





e What caused the syndrome? Is there a subset x (of X) of exposures and/or 
demographic experienced/attributed to participants in the PGW such that: for 
ailments a for which the prior equation is true, exposures/demographics x account 
for a significant part of the difference in occurrence rates of a in groups G and R? 


#a(G) 2 #a(R) 
#x(G) #x(R) 








P(alx,G) = = Р(а|х, К) 
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The lack of precise target syndrome definition encourages the development of multiple 
research strategies. As mentioned before, the directed query technique used by ССЕР (ССЕР, 
1996, pp. 17 - 49) has sliced the database from numerous different perspectives. What is needed 
is a search tool. which can examine multiple combinations of independent (LHS) and dependent 
(RHS) variables and all possible values for each variable simultaneously. This adds an extra 
dimension to the analysis. Conventional data mining tools typically allow the user to specify a 
range of possible LHS variables for search and a single RHS variable. Multiple RHS fields may 
still be handled under this doctrine by creating a pseudo field which contains a different value for 
each unique combination of values in the RHS fields to be examined. However, if the RHS 
fields for analysis are large in number or cannot be specifically identified, the pseudo field 
coding becomes impractically large. What is needed instead is a data mining tool which can 
apply selective induction operators to a range of possible attributes (not just individual attribute 
and value instances) on the LHS and RHS simultaneously. 

This methodology is plausible and in fact was done by DaMI in this research, but it is 
prudent to note that this strategy will still produce an extremely large search space. For example, 
the first analysis done by DaMI examines the associations between 15 standard symptoms (LHS) 
and 21 possible diagnoses (RHS). All attributes are Boolean and are not limited in the number of 
simultaneous combinations (all symptoms and diagnoses could be simultaneously present or 
“true”). Therefore the possible search space is 2*6 or 6.8 x 10” possible hypotheses. Itis for this 
specific reason that we chose to use a genetic algorithm, with its ability to discriminately analyze 
tremendous search spaces. A test was conducted in which this particular problem was analyzed 
using simple “brute force” (test every possible combination indiscriminately), using a 486DX/66 
Mhz personal computer. The personal computer was able to test about 600,000 combinations per 
day. At this rate, this one complete analysis would take 114,992 days (315 years). Even ifa 
platform were chosen that was 100 times faster than our test personal computer, the analysis 
duration would be an unacceptable 3.15 years. 
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2: Database Content and Structure 


Several problems were encountered during the course of this research with the CCEP 
database content and structure. These problems fall into two major categories: data 
representation anomalies which make it difficult for an algorithm to extract meaningful 
information from the data, and data collection anomalies which introduce bias into the data being 
analyzed. Examples of data representation anomalies include irrelevant data and non-normalized 
data. These problems must be corrected before useful analysis can be conducted; they usually 
require modification of the database itself. In the case of CCEP, data collection anomalies 
include data that were self-reported by participants, self-referral of PGW veterans to the CCEP 
program, and lack of an established control group. Collection anomalies do not interfere with 
analysis itself, but they must be acknowledged or accounted for when examining results. 

Seventy-seven fields in the CCEP database are simply unusable. Many fields contain 
sensitive unclassified data on the participants (names, social security numbers, addresses, etc.) 
which is not helpful for medical research and is subject to the Privacy Act of 1974. Those fields 
were deleted at the outset. Another larger group of fields is used by CCEP for administrative 
processing and are similarly not helpful to research. Finally, there were some fields that have 
been collected as non-standardized text. The most serious occurrence of this is the “chief 
complaint” or in other words the reason that the participant approached CCEP for an 
examination. No standardization was enforced in this free-text field so it is relatively impossible 
for a computer to determine similarity between tuples, short of creating a complete index of chief 
complaint texts and some standard category indicator. This is fortunately not the case with 
diagnoses, which use the standard numeric ICD coding system. Participant complaint 
information was captured in the form of fifteen standard symptoms, but a coded chief complaint 
would prove most helpful. 

A key shortcoming of the database, reported at the outset by CCEP, is the large amount 
of data which are self-reported by participants. Self-reported data are that which is directly 
determined by responses from participants during their medical examinations (as opposed to 
clinical test results, review of documentation, or impartial third-party observation). Self-reported 


data are analogous to a survey, which is in and of itself not a database flaw. However, in the 
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context of CCEP, all exposure and standard symptom data are self-reported. This reduces the 
direct applicability of aggregate participant responses because perceived exposure may be 
distinctly different from actual exposure. This is most easily demonstrated by an example we 
call “the Botulism Illusion.” Within the CCEP database, 26.4% (4,500) of the active-duty 
participants report receiving the botulism vaccine. Now it is known from medical records that 
only 8,800 or 1.26% of the 697,000 PGW veterans were given this vaccine. This high 
percentage (26.4% of participants) would appear to suggest a possible relationship between the 
botulism vaccine and PGW medical ailments, until it is pointed out that 21.9% ofthe CCEP 
participants who were examined and deemed “healthy” (primary diagnosis of V65.5) also 
reported receiving the botulism vaccine. (See Figure #1) Problems conceming reported data 
may be compensated for by collecting and examining a “control group” of participants who do 
not have significant medical conditions; however, reported data should always be interpreted 


with some degree of caution. 


Reported by CCEP Participants Reported by “V65.5” Participants 
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Figure 1. The Botulism Illusion 
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Another obstacle to a meaningful analysıs ofthe CCEP database is the self-referral 
(participants made a conscious decision to start the CCEP examination process) of participants. 
As described in Appendix A, any individual who was eligible for medical care under the MHSS 
system in 1994 and had a health concem related to PGW service (whether directly or indirectly) 
could request a full medical evaluation under the CCEP program. This encouraged a wide range 
of participants, but the self-referral of patients may invalidate the CCEP database as a statistical 
representation of PGW veterans as awhole. Had the participants in CCEP been selected 
randomly, then their aggregate response and demographic data could have been considered 
statistically representative. In this case, the sheer act of self-referral introduces some level of bias 
which, if it can be identified, should be explained to the degree possible. One possible solution 
is to randomly select a suitably large group of PGW veterans, regardless of health concems, and 
provide them with the same medical evaluation as the other, self-referred, participants. In other 
words, create a control group. A control group will help identify bias from both self-reporting 
and self-referring. Unfortunately, this was has not been adopted as part of the CCEP program. 
Suggestions have been made to create a control group after-the-fact, but a strong argument can be 
made that the passage of time since 1994 will introduce similar bias into the responses of a 
present-day control group. 

The reader should not infer that the CCEP database is a poor source; it has many strong 
points. After removal of unusable fields and reformatting other fields for enhanced analysis, 140 
“good” fields have remained for analysis. One of the most positive aspects of the database, is the 
standardization of CCEP data collection. From the outset, CCEP used the same database 
structure, examination process, and coding scheme for all medical examinations. There are some 
exceptions, such as the case of chief complaint (mentioned above) but overall the data content is 
strongly coded and standardized. Any reader who has dealt with data analysis at all, should 
appreciate the importance of a uniform database structure and coding system to computer 
analysis. Something as simple a representing an affirmative response as “Y” or “Yes” or “yes” 
can make computer-based query far more difficult. Of particular significance was the uniform 


usage of numeric ICD codes to represent outcome diagnoses. 
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3. Database Normalization 


The uniform coding scheme used in the CCEP database and limited need for scalar 
(continuous numerical) data sharply reduced the need for normalization (when used in a data 
mining context, “normalization” means structuring a database for effective computer analysis). 
The coding scheme used in the CCEP database is quite strong, so only a few modifications were 
made to normalize the database. Three significant modifications were made to the schema for 
analysis. Diagnoses were converted from single fields to multiple Boolean fields to facilitate 
analysis of diagnosis combinations. Standard symptoms were changed from durations to simple 
occurrence to simplify the ambiguity of comparing duration categories. Finally, an aggregate 
reproductive disorder field was created to relate reported reproductive disorders of any type. 


а. Boolean representation of diagnoses 


The CCEP database captures outcome diagnoses assigned by the examining 
physician as a primary diagnosis and six secondary diagnoses. CCEP researchers assign a 
somewhat higher emphasis to the primary diagnosis, and place little weight on the ordering of 
secondary diagnoses. Therefore, a medical researcher would not differentiate between a 
diagnosis of fatigue appearing second or say fourth on a list of diagnoses attributed to a 
participant. A computer on the other hand could consider these distinctly different occurrences. 
Since combinations are tantamount to this research, it is much easier to represent and analyze a 
string of diagnosis fields with Boolean (yes or no) operators than a string of up to seven 
unordered diagnoses. However, 1700 different diagnoses were assigned to the 19,000+ CCEP 
participants, so a pure Boolean representation would be extremely unwieldy. We decided to 
represent the twenty-one most frequently occurring diagnoses as Boolean operators in addition to 
the existing ICD representation. The number twenty-one was selected arbitrarily (it can be 
expanded in future research), but at least one of the selected diagnoses is included in 74.7% of 


participant outcomes. See Figure #2 below. 
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Orıginal Diagnosis Representation 





Figure 2. Diagnosis Attribute Restructuring 


b. Standard Symptoms 


In the CCEP database, participants are asked to report suffering from fifteen 
standard symptoms (e.g. chest pain, difficulty breathing, head aches). The responses are 
collected dates of onset and duration. The date and duration are subjective (and subject to error), 
and like diagnoses, difficult for an automated search engine to compare. A higher confidence can 
be assigned to a response if it is represented as a Boolean (the participant will in most cases 
accurately report existence of the symptoms, while his/her ability to estimate an onset and 
duration is questionable). Therefore, fifteen additional fields are added to the CCEP database, 
one corresponding to each symptom and equal to “Y” if the participant reported the symptom at 


any time for any non-zero duration. 


с. Reproductive Disorders 


One of the high visibility aspects of the PGW is the possibility that a syndrome 
may be causing PGW participants to experience a higher rate of reproductive disorders 


(specifically birth defects). The CCEP database captures reproductive disorders (participant may 
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report reproductive disorder actually experienced by a spouse or manifested in offspring) in five 


arcas. 


e Infertility 

e Міѕсагпареѕ 
e Still births 

e Infant deaths 
e Birth defects 


These five categories are further subdivided into disorders experienced prior to and after PGW 
service, making a total of 10 reproductive disorder fields. We cannot be certain that a syndrome, 
should it exist, would cause only one form of reproductive disorder. Therefore, two new fields 
were created to reflect any reproductive disorder experienced by the participant, either prior to or 
after the PGW conflict. In other words, if a participant reported infertility, a miscarriage, a still 
birth, an infant death, or a child with birth defects prior to PGW service, then the new field 

(PQ prior) was set to “Y.” If none of these were experienced prior to PGW service, then 

PQ prior was set to “N.” Similarly, if any of the five sub-categories were affirmatively answered 
after PGW service, then PQ after was set to “Y.” This will allow the research to be more 
sensitive to associations between demographic, exposure, symptom, and diagnosis data and any 
combination of reproductive disorders. Naturally, any interesting associations developed 
conceming these two new fields will need to be re-categorized by medical researchers before a 
finding may be made. 

After completion of normalization, 6 demographic, 32 reported exposure, 15 (Boolean) 
standard symptom, and 21 (Boolean) diagnosis fields are available for automated analysis. | 
These 74 fields observe a uniform structure and coding scheme and are the foci of this research. 
Please consult Appendix A for a detailed list of analyzed fields. 
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4. What is “Interesting?” 


In Section H.D.1, we asked the question, “What is a syndrome?” It is necessary at this 
point to revisit this question, but from an automated analysis perspective. A genetic algorithm 
depends (as do many other techniques) on the ability of the researcher to define in quantitative 
terms what is “interesting?” The problem in many forms of decision science is not whether a 
model performs accurately, but rather if it improves the quality of a decision. In a genetic 
algorithm, selection of hypotheses to evaluate is proportionally related to a “fitness” value for 
each hypothesis, so it is critical that our “fitness function” accurately represents the interest of 
medical researchers. This characteristic is reflected in the fundamental genetic theory: 


“Roughly, the fitness of a phenotype is the number of its offspring which survive 

to reproduce...This measure rests upon a universal, and familiar, feature of 

biological systems: Every individual (phenotype) exists as a member ofa 

population of similar individuals, a population constantly in flux because of the 

reproduction and death of the individuals comprising it. The fitness of an 

individual is clearly related to its influence upon the future development of the 

population. When many offspring of a given individual survive to reproduce, 

then many members of the resulting population, the “next generation,” will 

carry the alleles of that individual.” (Holland, 1975, p. 12) 
This returns us to the fundamental question: “What is interesting to CCEP medical researchers 
and how will that interest be manifested in the database?” In Section II.D.1, we stated that we 
are not sure whether a syndrome exists, and, if it does exist, we are not certain that the data 
captured in the CCEP database are appropriate to identify it. However, if these two uncertainties 


are removed, the following assertions can be made: 


e If there are one or more syndrome(s) affecting PGW veterans, the data to identify 
them may already exist in the CCEP database but is hidden by the sheer volume of 
data. 

e In this case, a syndrome will manifest itself as a single or unique group of diagnoses 
or symptoms shared by a cluster of participants sharing some common exposure 


and/or demographic attribute(s) 


26 


By plunging directly into a search for associative relationships between risk factors and 


outcomes, we bypass a fundamental step in classical epidemiological technique. Normally, 


epidemiologists will first define the outcome diagnoses and/or symptomatology which describe a 


prospective syndrome. Once the definition is made, then research efforts are focused on 


associations with risk factors and other exposure sources. Unfortunately, the present research is 


left with a less than optimal situation. We suggest that a promising use for a genetic algorithm is 


to give clues to medical researchers that help them define a syndrome. 


In this research, we have accepted that conventional research methods alone may not be 


able to define and isolate a syndrome affecting PGW veterans. We are now led to re-examine the 


problem from different perspectives. Our research approach has be guided by the following 


ıdeas: 


We are not trying to create an analysis that will isolate a single pre-defined Desert 
Storm Syndrome. Instead we are defining a profile that a syndrome might follow, 
should it exist. Our goal is to determine how a possible syndrome would be 
reflected in the data, as discriminately as possible, and then construct a fitness 
function which is appropriately high when this profile is met. 

Our genetic algorithm does not find a Desert Storm Syndrome, but rather distills the 
billions of possible hypotheses into a set of hundreds. All in the set of candidate 
hypotheses are not syndromes, but if a syndrome(s) does(do) exist, it(they) will be 
found in the candidate set. This smaller set of candidate hypotheses may realistically 
be examined more exhaustively by medical researchers and other conventional 
means. 

By implementing the genetic algorithm as a precursor to medical research (and 
alleviating the idea that it must find “the answer”), we allow the genetic algorithm to 
significantly reduce the burden on the relatively scarce medical research assets at a 
relatively small cost to the organization. In more basic terms, the secret to operating 
genetic algonthms in an imperfect world is to allow them to do the first 80% of the 
analysis work with only 20% of the research cost. 
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With the question of “interest” now bounded, a proper fitness function may now be 
pursued. If a true syndrome does exist, then it is “caused” by something. Therefore, the 
participants will share some finite set of exposure mediums, or in other words all participants 
with a syndrome will share some commonality in exposure. This must be caveated by saying 
that the CCEP database may or may not contain the demographic and exposure elements to 
identify that commonality of exposure. But as our research mindset states, we are only 
attempting to establish the profile of a syndrome if it exists, and if the data necessary to identify 
it is contained in the CCEP database. If the prior statement is true, then there will be a relatively 
strong association between a finite set of exposure/demographic attributes and a unique 
combination of outcome diagnoses. Likewise, there will be a strong association between a finite 
set of exposure/demographic attributes and a specific combination of standard symptoms. The 
intersection between diagnoses and symptom combinations with similar exposure associations 


will profile a candidate syndrome. See Figure #3 below. 


Standard Symptoms Qutcome Diagnoses 


2” Ex 
difficulty 
breathing 


memory loss 
jomt 
pain 


a Ша 
depleted \ 


botulism — uranium gender 


vaccine male 
al 


alarm scud 
gender attack 
female 


Reported Exposures/Demographics 


Analysis run £1 identifies high association between joint pain and hair loss, and botulism vaccine, depeleted uranium and 
male participants. 

Analysis run #2 identifies high association between memory loss and fatigue diagnoses, and botulism vaccine, 

depeleted uranium and male participants. 





Figure 3. Hypothesized Syndrome Profile 
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Now our question of “what is interesting?” can be defined. “Interesting” is combinations 
of RHS attributes (dependent variables) which are highly dependent on combinations of LHS 
attributes (independent variables), or in other words, the candidate dependent variables are truly 
determined (not independent of) by the candidate independent variables. The fitness function 
used must be such that hypotheses which demonstrate this property will be assigned a relatively 
high fitness value. There are numerous accepted functions in statistical literature that fit this 


requirement. Several of these are discussed in the next section. 


a. Conventional Epidemiological Measures 


A great deal of literature already exists, like (Goldberg, 1989) and (Holland, 
1975), to support the idea that genetic algorithms are quite successful at adaptively improving the 
quality of tested rules to suit the provided fitness function. From the outset, our genetic 
algorithm demonstrated this quality. However, the greatest challenge has been to ensure that the 
search model adequately represents the research questions (i.e. the genetic algorithm is doing 
what it was told to do, but have we provided it with relevant, meaningful instructions?). Asa 
starting point for development of the fitness measure for this research, we first turned to classical 
epidemiology literature. 

Classical epidemiology evaluates any test in terms of four variables (see Figure #4 
below) which describe how successfully a test predicts the actual presence (or lack) of a specified 
disease. This is much akin to our own research which attempts to identify the success of a single 
or multiple exposure and/or risk factor attributes predicting a combination of symptoms or 
clinical diagnoses. In epidemiology, these four variables {a, b, c, d} are computed using a two- 
by-two matrix of test results and actual disease presence. 
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Positive a b 
True Positive False Positive 


с d 
Negative False Negative True Negative 


Sensitivity Specificity 
a/(atc) d/(b+d) 





Figure 4. Classical Epidemiological Measures 


By mathematically manipulating these four variables, four “quality” values are obtained from the 
relationship between the subject test and subject disease. In each case, keep in mind that our 
research is applying the risk/exposure as a test for (or indicator of) a specific symptom and/or 
diagnosis profile. These quality values are (Fletcher, 1982, pp. 43 - 57): 


e Positive Predictive Value. Indicates the ability of a positive test result to accurately 
identify the presence of a disease in a patient. This term is similar to “confidence” used 
as a fitness measure in many data mining tools. We term this “forward confidence.” 

ВИС а 
a+b 

e Negative Predictive Value. Indicates the ability of a negative test result to accurately 
determine the absence of a disease in a patient. Most data mining tools do not consider 
this measure, but recommend the analysis be run with swapped dependent and 
independent variables. This is not practical if multiple dependent variables are being 
analyzed. 


d 
c+d 





PV(-)= 
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e Sensitivity. The proportion of subjects with a disease who have a positive test for the 
disease. A sensitive test will rarely miss people with the disease. 





sensitivity = 
а+с 


e Specificity. The proportion of subjects without the disease who have a negative test. А 
specific test will rarely misclassify people without the disease as diseased. 


лае а 
specificity = bad 


b. Fitness Measure Paradoxes 


In our research, classical epidemiology measures are helpful in choosing a 
suitable fitness function, but no single aforementioned measure is sufficient for several reasons. 
Rather we desire an aggregate fitness measure which will increase in response to any classic 
measure of interest. Fundamentally, this research problem differs from clinical test evaluation in 
one respect. While a high number of either false positive (b) or false negative (c) tests is a 
counter-indication of a test's quality, it is also desirable (in our case) if a risk/exposure 
combination is contraindicative of an outcome symptom/diagnosis set. In certain cases, a true 
positive may mean nothing because there are also many false positives. In other cases, a 
simultaneously high false positive and false negative is quite informative. This is best described 
by an example (Figure 45), but basically, in the case of CCEP database analysis, we are most 
interested in the hypotheses having highest values and lowest values of sensitivity and 


specificity. 
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> Consider the most simple hypothesis, 1 LHS (L) and 1 
RHS (R) field. 
* If L and R are Boolean, there are four possible hypotheses to test. 
* We are looking for more than just a high prob(R=“yes”|L=“yes”). 


INTERESTING NOT INTERESTING 
IF L = “yes” THEN R = “yes” 90%  IFL=“yes” THENR = “yes” 10% 
IF L = “yes” THEN R = “no” 10% IFL = “yes” THENR = “no” 90% 
IF L = “no” THEN R = “по” 30% FL=“no” THENR=“no” 80% 
IF L = “no” THEN R= “yes” 20% IFL= “no” THENR ="yes” 20% 


¿7 As the number of fields and/or values per field increases, the 
problem expands exponentially 





Figure 5. Attribute Value Relationships 


с. Alternative Fitness Measures 


Now that our concept of “interesting” has been framed from the epidemiological 
perspective, we can set about the task of selecting a single fitness measure which mathematically 
describes our concept of interest to the genetic algorithm. Again, there is some challenge in this 
because there are several different measures of interest to medical researchers (discussed in the 
previous section), yet the genetic algorithm requires a single aggregate fitness measure. The 
genetic algorithm could be run several times using different fitness measures, but this carries a 
high cost in both processing time and post-processing analysis effort. Likewise, we have seen 
from the preceding section that reliance on any single measure carries with it the possibility of 
statistical misinterpretation. Two paths were examined in this research to address this problem, 
although we note that there may be many other possible solutions. 


e Modified J-measure. Refer again to Figure #4 and the four test characteristics 
[PV(+), PV(-), sensitivity, and specificity]. Our first approach was to create a 
measure which was suitably large when any of these four measures were large and 
suitably low when none of the measures were relatively large—in effect an aggregate 
fitness measure. It should be noticed from the foundation we have laid that if both a 
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and d are relatively large when compared with b and c, the four test characteristics 
are all relatively large. This would demonstrate that the risk factors and/or exposures 
under investigation are highly successful in predicting the outcome symptoms and/or 
diagnoses under investigation. Tentatively we will select the following formula as 
our fitness measure: 

mod j(fitness) — аса 

Е bxc 

It may also be noticed that this measure will effectively indicate if the outcome 
symptoms/diagnoses are successful at predicting the risk/exposures. We call this 
property, “reverse confidence.” It is particularly helpful to examine the two sets of 
attributes with each assuming the role of dependent and independent variables 
simultaneously. Finally, recall that unlike the evaluation of clinical tests, CCEP 
analysts consider it interesting if both false positive and false negative values are 
simultaneously high (indicating a nsk/exposure combination reduces the probability 
of a symptom/diagnosis combination). To account for this situation, our j-measure is 
modified as follows 














T au Gt ee 
bxc bxc 
od p bxc 
bxc — axd 


(Figure #6 gives an example of a modified j-measure calculation; note we use a 
natural log function to shape the fitness function for better genetic competition; this 
will be discussed in Chapter V): 
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mod j-measure = | + In[(a*b)/(c*d)] 
1 + In(11*7505)/(84* 146) = 2.91 
Fatigue 


“yes” 
Bun a PV(+) 
1] | 11/(11+84) 
Uranium = 11.6% 


Ехрозиге 


PVC) 
7505/(146+7505) 
= 98.1% 


Sensitivity Specificity 
11/(11+146)=7.0% 7505/(84+7505)=98.9% 





Figure 6. Modified J-measure Calculations 


Chi-square. Another approach to the question of fitness function may be derived 
strictly from statistics. Since our aim is to identify risk factors and/or exposures that 
are highly associated with symptom and/or diagnoses groups, we may use a 
statistical principle which measures the independence (not the same as the term 
“independent variable” used in knowledge discovery science to denote the RHS 
variables) oftwo groups of attributes. According to Walpole, et. al, “The chi-square 
test procedure...can also be used to test the hypothesis of the independence oftwo 
variables of classification.” (Walpole, et. al., 1988, pp. 343 - 346) The same 
“contingency table” used by epidemiologist, may be constructed and used to 
compute expected levels of a, b, c, and d based on the joint probability function of 
the dependent and independent variables. (See Figure #7) Observed values are the 
original values of a, b, c, and d, and expected values are calculated using the 
following formula: 


Estimated Expected_Value = (column _total) x (row_total) 
o Е grand total 
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The chi-square is now calculated and summed for all cells in the matrix. (Chi-square 


may be used for any size matrix, in this case two were used for simplicity. Since a 


two-by-two matrix is used in the example, the formula below contains the Yates 


Correction, which is not necessary in larger matrices.) A higher chi-square 


indicates a higher level of dependence (or lack ofindependence) between the two 


attribute sets. The Chi-square formula (with Yates correction) follows; example chi- 


square calculations are included in Figure #7 : 
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chi-square(tot) = 39.32 
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Figure 7. Chi-square Calculations 


The modified j-measure has been used by this research to date, however a new statistical analysıs 


package designed to analyze using chi-square is currently being constructed. A more straight- 


forward formula for Chi-square will actually be used in the new statistical analysis package 
(Dixon and Massey, 1969, pp. 242 - 243): 


x? = (lad - bel - N)? N 
(a & bXa * cb * d(c d) 
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Ш. SOLUTION CONCEPTS 


A. RESEARCH GOALS 


In the case of the Desert Storm research, years of conventional medical research have 
yielded no single syndrome or associated symptomatology set. This means that the no fixed 
dependent variable set (combinations of diagnoses and/or reported standard symptoms) can be 
readily identified. The traditional epidemiological paradigm is to isolate a group of individuals 
with consistent symptoms/outcome diagnoses and then find what key demographic or exposure 
elements these individuals share. If relating demographic/exposure data are present, it is used to 
focus clinical research on an underlying cause. This approach has not proven fruitful to date, 
either because no syndrome exists or because the sheer volume of data in the CCEP database 
hides a relation of interest from human-controlled querying. Therefore, we have chosen to let 
technology simplify the problem from the outset of the knowledge discovery process. 

As mentioned before, there are four basic categories of useful data contained in the 
CCEP database (demographics, reported exposures, reported standard symptoms, and outcome 
diagnoses). While attributes in each category could prove useful as independent (LHS) or 
dependent (RHS) variables, it is doubtful that attributes from the same category will be useful as 
both LHS and RHS simultaneously. The research question is now simplified to an examination 
of which attributes (or combinations of attributes) in each category are most highly associated 
with (or statistically dependent on) which attributes from another major data category. 


EXAMPLE What associative relationships exist between exposure attributes and 
outcome diagnosis attributes? Based on analysis, there is a high association between 
reported exposure to Scud Attack and Depleted Uranium and an outcome diagnosis of 


Post-traumatic Stress Disorder. [This is just an example, not an actual finding] 


This exponentially increases the size of prospective search space which is represented by 
DALES « 2S (where #LHS = number of independent fields and #RHS = number of dependent 
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fields and all attributes are Boolean; if not the search space is even greater). The increase in 
search space can provide useful insight to medical researchers as they develop hypotheses. 
Instead of waiting for medical researchers to provide a more structured problem (and thereby 
reduce the search space), it was our feeling that an intelligent search technique could be 
employed effectively in the problem as given. Therefore, the role of our genetic algorithm is to 
test an extremely large subset of all fields in the CCEP database concurrently for levels of 
interest based on a specific model of epidemiological interest, to wit: 


O(LHS*, КН5*) = max(Q(LHS', RHS')) 
where LHS' c LHS * and RHS'c RHS* and 8() = fitness function 


We did count on CCEP medical researchers to define their concept of "interesting" and 
thereby guide our selection of an appropriate fitness function. This fundamental shift in 
knowledge discovery technique suggests that a genetic algorithm may be used to provide 
researchers with information to assist them in framing the initial research strategy, instead of 
framing the problem and then passing it to a genetic algorithm. We asked the following question, 
"If a syndrome does exist and the data necessary to identify it are contained in the CCEP 
database, what data relationships would it create in the CCEP database?" The answer to this was 
converted to a mathematical fitness measure. The resulting combinations of 
exposures/demographics and symptoms/diagnoses discovered will contain any identifiable 
syndromes’, but the entire set of hypotheses will not all be guaranteed to be useful solutions. The 
goal is to present medical researchers with a more workable solution space in which to focus 
their conventional research efforts. This approach shifts the burden of searching a tremendous 
alternative space appropriately onto the genetic algorithm. 
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B. SOLUTION STRATEGY 


Our solution strategy takes two forms, theoretical and practical. In the theoretical sense, 


the solution strategy rests on selection of the most efficient method of searching an extremely 


large solution space. There are three basic methods of search: 


Random. In this type of search, a computer program will randomly generate 
hypotheses and pass these hypotheses to an evaluating routine. The evaluating 
routine assigns a fitness measure to each hypothesis based on the fitness function 
provided. Ifthe hypotheses are generated sequentially, this method is also know as 
“brute force.” This method tests many hypotheses, because the hypothesis 
generation apparatus is extremely simple, but has no capacity to self-improve or tune 
the search to the operator’s goals. 

Human-controlled Selective Search. In this case, a human formulates a hypothesis 
and translates it into the form of a query. The query is evaluated by the computer 
system and the results are returned to the human operator. It is assumed that the 
human operator draws upon practical knowledge of the problem and the results or 
prior queries to formulate new queries. Therefore, the quality of query formulation 
improves throughout the process. This allows the search to self-improve (including 
the human operator within the boundary of the search system) and obviously tune to 
the operator’s goals. However, the hypothesis generation is extremely slow. 
Systematic, Intelligent, Automated Search. A computer program (genetic 
algorithm) generates hypotheses, passes them to an automated evaluator, receives 
results, and then re-generates a new set of hypotheses (systematically adapting its 
search based on its past performance as indicated in the results received). This 
technique demonstrates all three desirable search characteristics: fast hypothesis 
generation, self-improvement, and tuning to the operator's goals. 
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Figure #8 illustrates the comparative advantages of each search technique. It should now be 
clear, from a theoretical point of view, why a (genetic algorithm) systematic, intelligent, 


automated search has been chosen. 


# generated 
“search speed” 


systematic adaptation 
= Self-improves" 


Es 


~ Es 


nn” 


x Genetic Algorithm 


Human-controlled Selective “search tuned to user goals” 


CO Random Search 


^ intelligent selection 





Figure 8. Characteristic of Different Search Techniques 


Now let us discuss the solution strategy on a more practical level. Assume for a moment that a 
genetic algorithm performs a systematic, intelligent search as theorized. The next section will 
provide a theoretical basis for this assumption. From Section IL.D.4, we draw the premise that a 
syndrome will manifest itself as a high association between a specific combination of 
demographic and/or exposure attributes and a finite set of symptomatology or diagnoses. 
Combine this with premise that either a modified j-measure or chi-square formula will indicate 
the level of association (or dependence) between two sets of attributes. Our strategy is then to 
instruct the genetic algorithm (DaMI) to find the most significant associations between 
demographics/exposures and symptoms and between demographics/exposures and diagnoses. 
These two analyses will divide the compete set of possible combinations of 
demographics/exposures into three categories (note that demographics/exposures are traditionally 
viewed as the independent attribute set): 
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Demographic/Exposure combinations which appear on neither analysıs. Any 
hypothesis not contained on either study indicates that there is no statistical basis 
within the CCEP database to indicate that combination is a possible syndrome. This 
does not mean that it could not suggest a syndrome; as stated before, the CCEP 
database may not capture the appropriate data to identify the hypothesis as a 
syndrome. 

Demographic/Exposure combinations are associated with both specific 
combinations of symptoms and specific combinations of diagnoses. This is the 
ideal case for suggesting the existence of asyndrome. It indicates that a group of 
PGW participants, sharing both a common symptomatology and outcome diagnosis 
set belong to the demographic profile and/or report common exposure elements. 
Clinical research should be directed toward a prospective syndrome demonstrating 
the listed symptoms and diagnoses. Again this indicates that a hypothesis meets the 
mathematical definition of interesting, but the possibility of it being a syndrome can 
only be confirmed by evaluation by medical professionals. 
Demographic/Exposure combinations are associated with either specific 
combinations of symptoms or diagnoses. A majority of hypotheses identified by 
DaMI will fall into this category. If only one correlation is made with the 
demographic/exposure data, there is a weaker indication that this particular 
combination signals a candidate syndrome. However, failure to appear on both 
analyses should not completely discount the hypothesis. As mentioned before, the 
failure of the CCEP database to capture all symptomatology or diagnoses may 
explain the appearance of the demographic/exposure combination on only one 
analysis. Therefore, hypotheses in this category should still be evaluated by medical 


professionals. 


Naturally, a certain degree of ambiguity exists concerning the specific fitness measurement 
thresholds with respect to interest (filtering). Filtering will be discussed in Chapter VI. But in a 
practical sense, this analysis will provide medical researchers with a prioritized list of interesting 
associations. The central point is that most possible hypotheses will prove statistically 
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implausible and therefore fall into the first category, suggesting they not receive costly 


conventional medical research efforts. 
Finally, many initial DaMI discovery sessions were devoted to analyzing relationships 


between reported symptoms and outcome diagnoses. Early input from CCEP epidemiologists 
included a strong desire to identify unexpected symptom/diagnosis combinations. This study 
was appealing for initial research because all attributes involved were Boolean (as opposed to 
demographic and exposure attributes having more than two possible values). The research 
proved statistically successful (discussed in Chapter VI) but of limited practical value to CCEP. 
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IV. DaMI GENETIC ALGORITHM ARCHITECTURE 


Up to this point, this thesis has focused on the theoretical structuring of the CCEP 
research problem and formulating the qualities of a genetic algorithm required to solve the 
problem. The second half of this thesis will focus on describing the tool developed to meet these 
challenges and the success of that tool in actual analysis. Based on the preceding discussion, the 
genetic algorithm must be specifically designed: 


e to accept an unstructured set of dependent and independent variables 

e efficiently search an extremely large search space 

e employ adaptive learning, where a priori information is used to guide future 
hypothesis testing 


This chapter will deal with DaMI from a macro systems perspective; Chapter V will address the 
details of the system's design. 


A. PROGRAM MODULES 


Unlike many other genetic algorithms, the system designed for this research (DaMI) has 
been using several independent modules. These modules consist of the genetic algorithm itself, a 
statistical package, a user interface, and a verification package. There were two primary reasons 
for this design strategy. The first was to relieve the genetic algorithm of the mundane analysis 
tasks, results filtering, and user interface tasks, thereby enhancing the space searching efficiency. 
The second reason was to aid in system development. By adopting a modular development 
approach, a great deal of effort can be focused on the core genetic algorithm technology and 
allow the system to begin rapid prototyping before optimal statistical analysis and user interface 
modules were developed. Once the core genetic algorithm is properly functioning, more robust 
statistical engines and user options may be added, using experience gained from test runs. A 
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more in-depth explanation of the genetic algorithm (GA) operation is contained in the next 
chapter. Figure #9 shows the relationship between the DaMI modules. 
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Figure 9. Relationship of DaMI Modules 


L. The Genetic Algorithm Package 


The genetic algorithm package is responsible for maintaining a list (population) of 
hypotheses (rules) in the current generation, selecting the most successful rules, and performing 
the genetic operations of reproduction, crossover, and mutation. These genetic operators allow 
the system to adapt the analysis to the goal model (fitness function) and improve the search 
hypotheses as each generation is processed. In this thesis, “hypothesis” and “rule” are used 
interchangeably; "hypothesis" is a medical research term and "rule" is a artificial intelligence 
term. Clearly, not all possible hypotheses will be tested (hence the advantage of the genetic 
algorithm), but the use of genetic operators ensures that the rules being tested have the highest 
probability of satisfying the given fitness function (Holland, 1975). In the DaMI system, the 
genetic algorithm stores hypotheses as combinations of attributes only, not as combinations of 
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attributes and specific values. Competition is based on success of attribute sets as a whole. 
Attribute sets (like gender, receiving the botulism vaccine, exposure to uranium [independent 
variables] and Depression and Chronic Fatigue Syndrome [dependent variables]) are passed to 
the statistical package, which returns an aggregate fitness value for all possible value 
combinations of those attributes. The statistical package is called recursively during the 
processing of a single generation for every rule, until the entire generation is evaluated. Then the 
genetic algorithm produces the next generation and the process is repeated. 


2; The Statistical Analysis Package 


The statistical analysis package receives a set of independent and dependent attributes to 
evaluate from the genetic algorithm package. The statistical package requires no information 
other than a list of field names to evaluate. The number of attributes in each request sent to the 
statistical package varies, so it must be capable of processing loosely bounded problems. 
During pre-processing, the analysis database (database under analysis; in this case the CCEP 
Persian Gulf War Database) is examined and a table is created of all attributes and their possible 
values. This table is used as the source for generating each individual query (there are many 
individual queries generated to answer each request form the genetic algorithm) and ensuring that 
each possible combination is tested but only once. The statistical package then computes the 
fitness of each possible attribute/value combination. An aggregate fitness measure is then 
computed and retumed to the genetic algorithm package. As the statistical package tests 
attributes against the database under analysis, it also performs a test of each attribute/value 
combination against a second database. This second test is not returned to the genetic algorithm 
and therefore does not affect hypothesis competition. This value is stored to be used later for 
results validation (see section V.C). 
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3. User Interface 


The user interface controls interaction between DaMI and the system operator. The user 
interface allows the user to adjust tunable parameters (discussed in Chapter V), view the 
discovery database at various stages of processing, and start and reset the genetic algorithm 
package. The user interface also provides intermediate feedback to the user during DaMI 
operation. It was designed using the Foxpro Screen Design Wizard and is controlled by push 
buttons and pop-up menus. Settings may not be adjusted “on-the-fly” when the genetic 
algorithm is operating. An example of the user-interface screen is shown in Figure #10 below. 
The user-interface module is disposable, and therefore an in-depth discussion of the user- 


interface design is not included in this thesis. 


Population Size 


Number of Generations 
Crossover Probability 


Mutation Probability 





Figure 10. DaMI User Interface 
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B. REPORTING AND FILTERING 


Once a discovery session has been completed by DaMI, several files are created. A 
transcript of each hypothesis individual (at the attribute level) of every generation is created as 
DaMI operates, along with a transaction record of each genetic operation employed, the source 
(parent) rules, and resulting offspring. The transaction record also maintains a time stamp at the 
start of each generation which can be used to monitor processing speed. DaMI also records how 
many actual combination were tried during the session. These files will not be discussed in 
detail (file structures are contained in Appendix B). 

The most important file created (rulelib.dbf) contains a list of every hypothesis tested and 
used to determine an aggregate fitness measure (without duplication). Several key points must 
be cleared up at this juncture. First, not every possible attribute/value combination is used to 
compute the aggregate fitness value of a given attribute set (this is a tunable parameter). Second, 
Rulelib.dbf stores attribute and value combinations (as opposed to the session transcript which 
records only the higher-level attnbute sets). It also contains the intermediate, final, and 
verification fitness measures. This makes rulelib.dbf the actual answer produced by DaMI. 
Figure #11 is an excerpt from rulelib.dbf. 
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Figure 11. Rulelib.dbf Display 


Finally, whatever fitness measure is used will probably not have an arbitrary threshold of 
"interest." A fitness measure is only useful in ranking the relative interest of hypotheses tested; 
therefore some form of filtering will be done prior to reporting. However, it is inadvisable to 
enforce that filter during operation. Instead, rulelib.dbf is left in the most robust (non- 
summarized) form practical; filtering is performed arbitrarily using SQL type query language on 
a case-by-case basis for each report. 

Several reports have been developed in Foxpro for the DaMI system. However, as with 
filtering, reports are tailored to suit the needs of each individual recipient. Summary reports are 
created on an ad-hoc basis; there is a standard detailed report which contains hypotheses and all 
intermediate and final statistical computations. The detailed reports (two main studies were 
conducted in this thesis) of the top 100 hypotheses discovered are contained in Appendix C. 
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C. SYSTEM REQUIREMENTS 


1. Hardware and Software Requirements 


From the outset, the author’s goal was to construct a research tool and methodology that 
can be employed by researchers in their community, without the need for a laboratory of (scarce) 
high-power computer assets. In any case, it has already been shown that raw processing power is 
quickly overcome by large unstructured database analysis requirements. Therefore, a genetic 
algorithm is used to intelligently enhance the processing capabilities of whatever platform it runs 
on. In keeping with this goal, DaMI was designed to operate on a standard personal computer 
using inexpensive commercial software. The hardware and software requirements required to run 
DaMI are listed below: 


Hardware Requirements 
Personal Computer, 80486/66Mhz processor or better 


8 Megabytes of RAM 
200 Megabytes of free hard disk storage 


Software Requirements 
Microsoft& Visual Foxpro version 3.0 


Microsoft® Windows version 3.xx or Windows 95 
Surpassing the minimum hardware requirements will of course benefit system performance. The 


most dramatic performance improvements will be realized by increasing RAM and the access 
speed of the PC hard drive. 
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2. Processing Limits 


DaMI is primarily limited by the time available to the user to complete the analysis; 
however, there are some processing limitations. For the preservation of system speed, рам! 
maintains the active population in a RAM-based array. Therefore, it is limited by the maximum 
array size allowed in Foxpro. The required array size is a function of population size per 
generation and number of attributes under analysis. The formula for this metric is: 


population size x analysis _ fields < 73,500 


Under this limitation, analysis of 70 field with a population size of 15,000 (array size 1,050,000) 
would exceed the system limits. Only the number of fields actually under analysis is used in this 
calculation, not the number of fields in the database being analyzed. Also, the number of records 
in the analysis database is limited only by the maximum Foxpro table size (Maximum records 
per table file = 1 billion, Maximum size of a table file = 2 gigabytes, Maximum fields per record 
= 255 ). Naturally, larger files will take longer for the statistical package to analyze. 
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V. SEARCHING THE HYPOTHESIS SPACE: DaMI 
IMPLEMENTATION 


A. THE GENETIC ALGORITHM 


The basic architecture of the DaMI Genetic Algorithm is based on (Goldberg, 1986), 
with the notable exception that our genetic algorithm stores rules as strings of Boolean attributes 
("true"—consider the attribute; "false"=don't consider the attribute). This allows the genetic 
algorithm to process simple binary strings, as opposed to strings of field values and wildcards 
(Goldberg uses a "*" to denote any value of this attribute is acceptable). This does not imply that 
the genetic algorithm is simplistic, in fact competition of attributes in aggregate actually provides 
for a more efficient search of the alternative space. As can be seen ın Figure #12, a conventional 
genetic algorithm will operate hypotheses as combinations of attributes and values. In our case, 
this prevents the genetic algorithm from considering the associations between risk factors 
(exposures/demographics) and outcomes (symptoms/diagnoses) in aggregate. By using the 
DaMI methodology, risk factors and outcome associations (hypotheses) are examined 


comprehensively before competing for selection and genetic recombination. 
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Be onventional Genetic Algorithm Representation (Goldberg, 1989) 
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Figure 12. Conventional and DaMI Algorithm Representations 


This genetic algorithm uses a "roulette wheel" (Goldberg, 1939) model for competitive 
selection with the size of each rule's "slice" (or probability of selection) being directly 
proportional to the fitness measure (determined by the statistical package) of each rule. Slices are 
selected for reproduction, crossover, and mutation randomly, but the "size" of each slice gives a 
proportionally higher chance of survival to rules with higher fitness. As individual rules show 
reproductive dominance, these individuals may possess more than one slice on the roulette 
wheel. (i.e. a particularly strong rule may reproduce more than once per generation, giving it 
more than one slice on the subsequent generation's roulette wheel). We chose the roulette wheel 
(Goldberg, 1989) because it allows the stronger rules to dominate more quickly than with other 
methods (e.g. rank or tournament) and thereby converge faster. The basic genetic operators 
(reproduction, crossover, and mutation) are all implemented in DaMI, with operator adjustable 
profiles (see section V.D). 
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B. THE STATISTICAL ANALYSIS ALGORITHM 


The DaMI statistical package in use is a fairly simple algorithm. The modular design of 
our system allows for the replacement of this statistical package with a more robust commercial 
package in the future. At this point, the cost of designing an interface outweighs potential 
benefits; this may not be true for more complex analysis projects. 

Given a set of dependent attributes (RHS) and independent attributes (RHS), the 
statistical package creates a two-dimensional array of attributes and possible values. The array 
also contains the number of possible values for each attribute and a counter for each attribute. As 
the statistical algorithm processes each combination, the counter for each attribute 1s incremented 
accordingly using the base counting of each attribute corresponding to that attribute's number of 
possible values. (i.e. if the attribute "gender" had two possible combinations then its counter 
would increment in base 2; if the attribute "state" had fifty combinations then its counter would 
increment in base 50). The algorithm uses each individual attribute's current counter value to 
reference a cell in the array. The cell values and attribute names are used to create a textual query 
statement. The query statement is then applied to the analysis database and the fitness measure is 
applied to the result. This allows the same statistical algorithm to loop recursively with a 
minimum amount of software code, regardless of the number of attributes passed to it by the 
genetic algorithm. 

Several fitness measures have been used (see the discussion in section II. E.4). Our goal, 
since medical researchers seek associations between patient risk factors/exposures, reported 
symptoms, and resulting diagnoses, is to award the bighest fitness values to those LHSs and 
RHSs which are most highly interdependent (vice independent). Since each request from the 
genetic algorithm generates many individual statistical package queries, some means of 
aggregating the fitness measures of all possible combinations is required. Several different 
methods for determining the aggregate fitness measure were considered. Obviously, an average 
of all fitness measures for a given attribute set is non-competitive. In many cases, the highest 
individual fitness measure has been used because of the specificity of the research question. Ín 
other cases, an aggregate measure may be taken using Chi-square or an average of the top three 


53 


or four j-measures (use of an aggregate value limits the awarding of a high fitness measure based 
on a single unexpected outlier in the research database). 

A tule cacher (like a disk cacher, except for hypotheses) is used to prevent duplicate 
evaluation of any rule throughout the discovery session. A table of rules evaluated by the 
statistical package and resulting fitness values in maintained. Before sending a rule to the 
statistical package, the genetic algorithm checks the table of rules already evaluated. If the rule 
has been previously evaluated, the genetic algorithm uses the fitness value from the cache table. 
If not, the genetic algorithm package sends the rule to the statistical package and establishes a 
new entry (with resulting fitness) in the cache table. 


C. TUNABLE PARAMETERS 


The program has several tunable parameters to adjust genetic algonthm operation. 


Tunable parameters are set via the user interface at the commencement of each discovery session. 


e Crossover probability. probability that a selected rule will exchange information with 
another selected rule 

e Mutation probability. probability that a selected rule will undergo a random mutation 

prob(reproduction) = 100% — (prob(crossover) + prob(mutation)) 

e Population size. number of individual rules in each generation number of generations to 
simulate 

e Maximum rule complexity. maximum number of dependent and independent attributes 
allowed in each hybrid rule (set individually for dependent and independent) 

e Average complexity of initial rule set. average number of dependent and independent 
attributes allowed in each rule of randomly generated initial population 

e Top rules to aggregate. number of rules (in order of decreasing fitness) to use in 
computing aggregate fitness by the statistical package 


54 


D. PROBLEMS AND IMPROVEMENTS 


Before this discussion of DaMI implementation is concluded, we would like to discuss 
some of the problems encountered in our implementation and our solutions to these problems. 
We found, as many other researchers have, that genetic algorithms are quite successful at 
adaptively improving the quality of tested rules to suit the provided fitness function. However, 
the greatest challenge has been to ensure that our search model adequately represented the 
research questions (i.e. the genetic algorithm is doing what it was told to do, but have we 
provided it with accurate instructions). Our focus on problems with proper tuning of the genetic 
algorithm should in no way degrade the perception that a genetic algorithm is an extremely fast 
and effective search technique. It does work as advertised!. 


1. Convergence Issues 


One challenge faced by our research was to ensure that the algorithm would effectively 
(not necessarily physically) test the entire search space. A genetic algorithm will rapidly 
(especially using roulette wheel competition) improve the average fitness measure of rules within 
successive generations, but in many cases, the speed of improvement degraded the algorithm's 
ability to comprehensively examine the search space. 

It should be recalled from genetic search theory (Holland, 1975) that search regret (or 
missed rules of interest) is minimized if attributes of successful rules are tested in exponentially 
more combinations in successive generations, and attributes of unsuccessful rules are tested 
exponentially fewer times. This is implemented in a genetic algorithm by giving successful rules 
a higher chance of selection (and thereby the chance to mix information with other successful 
rules) based on the level of their fitness measure. Naturally, successful rules begin to dominate 
the population (in our case take up more slots on the roulette wheel) and increase the chance that 
their constituent attributes are used for future rules. A problem arises when the fitness measure of 
a mediocre rule is disproportionately larger than the other individuals of its generation. If this 
mediocre rule dominates the population too quickly then it's attributes provide the only material 
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for future rules. The resulting phenomenon is called premature convergence (Koza, 1988) and 
will prevent comprehensive search of the entire space. 

Several steps were taken to prevent this, but generally speaking, great care must be used 
in selecting a fitness measure. If the slope of fitness in proportion to rule quality is too great, 
premature convergence is likely. The author chose to apply a natural logarithm scale to the 
fitness measure. This gave a strong relative advantage to good mules over weak rules, but slowed 
the domination of good rules (or local maximums) over their slightly weaker peers. The author 
also developed a technique called same-parent crossover randomization. Basically speaking, if 
two identical parents are selected for crossover, the resulting "offspring" are duplicates of the 
parents. In our crossover operator, if the two parents are the same, a single parent is randomly 
bisected into two offspring. Each offspring receives a portion of the parents genetic material 
(attributes) and a portion of randomly generated material. This has no effect on the algorithm at 
early stages, but it increases the mutation probability strongly as the population becomes 
dominated by a few rules (which causes the crossover operator to loose its ability to effectively 
generate new hypotheses, see Figure #13). 


Cumulative Fitness 
Crossover % 
Mutation % 


100 
__~ As population prematurely 
90 converges: 


_- Crossover effectiveness 
70 decreases and 

__ Same-parent Cross 
EC increases mutations 





Figure 13. Effect of Same-parent Crossover Randomization 
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Finally, it was noted that since a genetic algorithm is based on probabilistic selection, 
some extremely strong rules failed to be survive (by sheer chance) despite their selective 
advantage. This is an understandable consequence of natural selection; sometimes more capable 
species die solely because of “bad luck.” The author reserved several spaces on the roulette 
wheel for the rules with the highest fitness measure in the population, regardless of their 
selection by the algorithm. This ensures that an extremely "good" rule will continue to be 


available for selection and recombination in successive generations. 


D Processing Speed Issues 


However sophisticated the search technique may be, we must still keep the magnitude of 
this search problem in mind. One of our research goals was to ensure that the technology created 
did not require sophisticated, expensive, or proprietary hardware or software. For this reason the 
DaMI application was developed to run on a 80486/66Mhz personal computer using the 
Microsoft Window 3.xx or Windows 95 operating system. (Pentium 166's are used for 
production runs.) A very simple problem such as analyzing relations between 15 standard 
symptoms and 21 diagnoses (Boolean fields) yields a search space of 69 billion combinations. A 
486 computer, using the "brute force" method, can test about 600,000 hypotheses (rules) per day. 
At that rate, this problem would take more than 315 years to complete. Even if the speed of 
processing could be accelerated by a factor of 100, the problem would still be impractically large. 
We have processed runs involving exposures/demographics and diagnoses that were on the order 
of 9.457 * 105. Actual processing benchmarks are included later in the paper, but the point for 
the moment is that results using genetic algorithms take days not minutes to achieve. 

Naturally the author took several steps to enhance speed on the given PC architecture. 
First, the population of rules is maintained in a RAM-based array space as is the statistical 
package's attribute and possible value matrix. This allows the genetic operations to be carried out 
with extreme speed. Task complexity is not really a speed issue at all for the genetic algorithm 
package; unfortunately, the database under analysis cannot be placed in RAM, so the statistical 
package becomes the speed limiting operation. Genetic operations take several seconds per 
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population, but the statistical package may take hours to analyze a single, large population. In the 
case of the statistical package, number of attribute and possible values is much more significant 
than the number of records in the analysis database. If the operating architecture could be 
enhanced to allow the genetic algorithm to pass statistical requests to multiple personal computer 
nodes, a significant processing advantage could be attained. 

The nature of our research question concerning a possible syndrome affecting Persian 
Gulf War participants limits the complexity requirement of rules generated. In other words, rules 
involving too many attributes may be statistically significant, but are so specific that they may 
only describe a single participant. Naturally, these rules may have a selective advantage over less 
specific rules, because a single outlier reporting a highly unusual combination of attributes will 
be very highly rated. However, rules involving a single individual do not suggest a syndrome, 
which by definition is a series of conditions affecting a group of individuals. Therefore, we 
included a tunable parameter which limits the maximum complexity of rules generated. Rules 
involving too many attributes are given a low fitness function and are not sent to the statistical 
analysis package. It should be obvious that increasing the number of attributes in a single rule 
exponentially increases the complexity of the analysis by the search package. 


3. Tuning the Fitness Measure, Verification, and Validation 


One of greatest challenges faced is to develop a fitness that accurately reflects the 
requirements of CCEP medical researchers. It is critical that feedback is obtained at every step of 


the discovery process. 
EXAMPLE Just because there is a high association between hair loss and chronic 
fatigue syndrome within the database under examination does not mean that this is of 


any medical significance. 


It must also be understood that our technique has drastically reduced the number of 
correlations to be investigated by medical researchers, but it does not guarantee that each rule is 
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of value. That knowledge can only be obtained from medical professionals. Our goal is to 
provide a catalyst for their research and a "jumping off point" for more in-depth clinical 
investigation. If that mindset is maintained, the genetic algorithm is proving most helpful. 

Verification is also a key issue. Rules and their associated fitness measures generated by 
a genetic algorithm will be true. That has been easily verified by conventional query. Ensuring 
that the rules generated are the best ones to describe the analysis database is more challenging. 
We have two different methods for responding to this challenge, duplicability, and 
reproducibility. 

The database of 19,000 records has been split into several sample sets. Each sample set is 
selected randomly without replacement. We actually use two database subsets of around 7,700 
records each. The genetic algorithm is applied to one sample subset and its output rules are then 
applied to the second subset. If the fitness measure for a rule is uniform throughout the two 
independent, randomly-selected databases, then there 1s confidence that this rule holds for the 
entire database and is not a statistical anomaly. We call this attribute duplicability. 

The second verification procedure is reproducibility. It cannot be proven that a genetic 
algorithm has actually found the best rules for a given search space. The only way to accomplish 
this is to actually check every possible combination, which we have already stated is physically 
impractical. How then may we have any certainty that the technique has worked; that the 
algorithm has used a sufficiently large population over a sufficiently large number of generations 
to achieve an acceptable answer? Since a genetic algorithm depends on the simulation of survival 
of the fittest (Darwinism) based solely on probability modeling and random number generation, 
it will never analyze the same problem the same way twice. We run every problem twice and 
note the number of rules that occur in both outcome rule sets. If both independent discovery 
sessions produce a high number of the rule intersections, then this indicates that the state space 
has been searched exhaustively (see Figures #14 and #15). If this 1s not the case, then the 
population size and/or number of generations must be increased for an effective discovery 


session. 
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Figure 14. Strong Reproducibility in GA Search 
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Figure 15. Weak Reproducibility of GA Search 


Finally, a great deal of emphasis is placed on the discovery of rules which are intuitively 
obvious to medical professionals. This may appear insignificant at first, but as mentioned before 
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genetic algorithms are unguided random processes possessing no knowledge of medical facts. If, 
through their leaming process, they produce a series of rules that mimic accepted medical 
knowledge then this lends confidence that accompanying rules, which do not make intuitive 


sense, may contain new and significant information. 
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VI. RESULTS 


A. SUMMARY 


DaMI has achieved striking successes throughout our experiments. The theoretical basis 
for the design of this search algorithm is sound and has allowed this system to perform and 
produce results. DaMI is a very exciting application because its performance matches or exceeds 
theoretical expectations, and it identifies previously undiscovered correlations in the CCEP 
Desert Storm Database. In this chapter, we will characterize the initial success of DaMI by 
presenting a series of experimental results which build on the framework developed by this 


thesis. Success in this research is metered by responding to the following questions: 


e Did the Genetic Algorithm (DaMTI) perform as theoretically predicted? 

e What correlations did the Genetic Algorithm actually find in the CCEP database, and 
were these hypotheses, at least from a statistical perspective, consistent with the 
research goals? 


e How useful were the hypotheses discovered to CCEP medical researchers? 


Each will be examined individually in the following sections of this chapter, building up to a 
comprehensive evaluation of DaMI's theoretical as well as practical performance. 

Twenty-five discovery sessions (runs) have been conducted by DaMI thus far, of which 
six production runs are discussed in the results section. Earlier runs were used to test the 
performance of DaMI during development and refine the settings of tunable parameters for 
optimal discovery. Genetic algorithm development is a constant process of discovery, feedback 
and refinement. The runs conducted to date are by no means all-inclusive, but rather chronicle a 
successful venture into the CCEP database. 

| DaMI has been directed to analyze two different perspectives of the CCEP database 
(three identical production runs for each perspective). The first runs search for associations 
between the gender, service, race, and reported exposures of PGW participants (LHS) and the 


63 


diagnoses that were assigned by the CCEP medical examination process (RHS). We refer (о 
these runs as exposure-to-diagnosis runs. The second set of runs search for associations between 
gender, service, race, and reported exposures of PGW participants (LHS) and the standard 
symptoms that were elicited during the CCEP medical examinations (RHS). We refer to these 
runs as exposure-to-symptom runs. The reader is referred to Appendix A for a detailed list of 
fields included in each analysis. Each production run utilized a population size of 1000, cross- 
over probability of 30%, mutation probability of 3.0% (see section V.C for a discussion of 
tunable parameters). Modified j-measure has been used as a fitness measure, and only the single 
best j-measure of all combinations of each individual attribute set was used for aggregate fitness 
by the statistical analysis package (see section V.B). Hypotheses generated were limited to 
combinations of up to three LHS attributes and two RHS attributes. Production runs have 
simulated at least 130 generations; some were allowed to continue for 170 generations. 


В. DID THE GENETIC ALGORITHM PERFORM AS 
EXPECTED? 


As theoretically predicted, DaMI performs very well, in terms of speed, hypothesis 
quality improvement, and search space coverage. This question focuses solely on the ability of 
DaMI to perform an efficient, self-improving search and not on the value of results to medical 
professionals (which will be discussed in the next section). The tremendous size of the search 
space has been mentioned earlier, but the number of possible combinations should be presented 
specifically at this point: 


е Exposure-to-diagnosis Runs. 29 Boolean reported exposures, gender (2 possible 


values), service (6 values), race (8 values), and 21 Boolean diagnoses. 


Possible combinations = 2” x2x6x 7x27! =9.46x10"° 
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e Exposure-to-symptom Runs. 29 Boolean reported exposures, gender (2 possible 
values), service (6 values), race (8 values), and 21 Boolean symptoms. 


Possible combinations = 2” x 2x6x7 x2" = 148 x10! 


It is clear that these two types of runs present a credible challenge to any genetic algorithm. 
They are both computationally explosive (because of search space size) and highly unstructured 
(because of the high number of LHS and especially RHS attributes), yet DaMI has processed 
them with striking success. 


1. Analysis Speed 


DaMI’s search efficiency allows it to perform analyses, which normally take years, in a 
matter of hours. Analysis speed is the time required for a genetic algorithm to comprehensively 
search the given space. Comprehensive search will be dealt with shortly, but at the moment, we 
will focus on the time required for DaMI to complete an analysis. If that time is significantly 
less than would be possible using a “brute force” examination of the same database, then the first 
advantage has been achieved. As mentioned in section II, it was observed that a personal 
computer can test about 600,000 possible combinations per day. If that is the case, then the 
exposure to diagnosis run should take about 432 billion years—this is clearly not acceptable. 
Since DaMI never searches a space the same way twice, analysis times for the same problem 
vary; however, DaMI performs the same analysis in 36 hours (on average). Exposure-to- 
symptom runs take about 44 hours, using the genetic algorithm. Although the exposure-to- 
symptom runs involve a smaller search space, DaMI requires more generations to converge on an 
answer. Analysis times do increase in relation to the number of possible combinations; however, 
the character of the research question also affects the time required for DaMI to converge on an 
answer. Analysis times of similar runs are fairly consistent (less than 10% deviation). A profile 
of the three DaMI exposure-to-diagnosis runs is illustrated in Figure #16. 
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Figure 16. Analysis Speed Profile of Exposure-to-diagnosis Runs 


Notice that the processing speed increases as a small group of rules begin to dominate the 
population (convergence). It must be reiterated that DaMI uses the same platform as was used 
for “brute force” testing;” it is the selectivity of search (knowing what alternatives need not be 
tested) that gives this methodology its incredible advantage. 


2. Hypothesis Quality Improvement 


DaMI is consistently able to adaptively improve the quality of the hypotheses it 
generates as the analysis progresses. А genetic algorithm is theoretically an intelligent, adaptive 
search technique. This means that as processing time passes, the system will generate 
hypotheses of increasing quality based on the results of analyses already conducted. In the case 
of DaMI, this means quality is indicated by the fitness measure of a hypothesis. The cumulative 
fitness of a generation represents the aggregate quality of all the hypotheses synthesized during 


that generation. Although some new individuals in each generation may receive very low fitness 
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measures, ifthe cumulative fitness increases in successive generations, then the quality of 
hypotheses as a whole are improving. DaMI demonstrates the characteristic ability of genetic 
algorithms to rapidly increase the quality of new hypotheses generated. DaMI rapidly improves 
cumulative fitness until a small group of rules begins to dominate the population [premature 
convergence (Koza, 1989)], but (largely because of same-parent crossover randomization) it then 
boosts mutation probability and continues to break through to higher cumulative fitness plateaus. 
A profile of improving hypothesis quality for exposure-to-diagnosis runs is presented in Figure 
#17. Note that in each of the three runs, the cumulative fitness curve levels (signaling premature 


convergence) and then continues to sporadically increase. 
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Figure 17. Analysis Speed Profile of Exposure to Diagnosis Runs 


3. Reproducibility: Search Space Coverage 


While a genetic algorithm may complete a search quickly, the speed advantage is of 
limited value without some indication that the results derived are actually the best in the search 
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space. DaMI produces consistent reproducibility on the extremely large spaces it searches, 
attesting to its strong ability to search a large space by testing a small subset of possible 
combinations. As discussed in section V.D.3, proving that a genetic algorithm has completely 
examined a space is a paradoxical question-you cannot prove that the genetic algorithm made 
the right decision without testing every possible hypothesis. Reproducibility gives a strong 
indication that the alternative space has been searched effectively. Ideally, we would like 
multiple independent runs of the genetic algorithm (see section V.D.3) in order to test only a few 
of the same mules of low fitness but converge on the same rules of high fitness. A low 
intersection of low fitness rules between runs indicates that each approached convergence from 
different areas of the search space (i.e. they did not all follow the same path). A high intersection 
of high fitness rules suggests that, despite entering the search space from different directions, 
each independent run has arrived at the same answer. This reproducibility strongly suggests that 
the entire search space has been effectively, but not physically, examined. 

DaMI achieves high reproducibility in spite of the rapid search time and tremendous 
space. In the exposure-to-diagnosis study, all three runs agree on the same 16 highest fitness 
hypotheses. Lower fitness hypotheses show steadily decreasing levels of intersection, as is 
theoretically predicted. This is particularly exciting, because each production run has achieved 
consensus by testing only 7,100 - 7,400 of the 1,041,000 possible attribute combinations. The 
probability of three independent runs randomly agreeing on the same sixteen hypotheses 
(especially since each run is testing only 0.7 % of all possible attribute combinations) is 
infinitesimally small. The natural question is, “Did the three runs, by some streak of luck, enter 
the search space from the same starting point?” This is not the case, because the three runs only 
tested 14.1% of the same lower fitness rules, proving that they have entered the space from 
different points but converged on the same answer. Note in Figure #18 that the percentage of 
rule intersection (Runs 20, 21, and 22 are the three runs conducted in the exposure-to-diagnosis 
study) between runs approaches 100% for rules with a fitness measure higher than 8.0. This 
intersection decreases steadily as the fitness measure decreases (going left on the graph). In the 
case of exposure-to-symptoms, the reproducibility is not as high, but still quite stnking. In this 
study, each run tested between 8,000 and 10,000 hypotheses. The three runs agree on 5 of 6 
highest fitness hypotheses. This is represented in Figure #19 by an intersection percentage of 
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80% on hypotheses with a fitness of over 5.31 (Runs 23, 24, and 25 are the three runs conducted 
in the exposure-to-symptom study). Notice that, as in the exposure-to-diagnosis study, the 
intersection between runs decreases as the fitness measure decreases, culminating with an 


intersection of only 20% for rules with fitness measures between 1.0 and 3.0. 
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Figure 18. Exposure-to-diagnosis Reproducibility 
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Figure 19. Exposure-to-symptom Reproducibility 
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Based on the high reproducibility of DaMI production runs, there is a strong indication 
that the search space has been effectively searched for the given fitness measure and search 
parameters. This is particularly significant in the case of Desert Storm research. Recall that the 
existence of any syndrome has not yet been determined. Therefore, if DaMI fails to find a viable 
syndrome profile but can show that the space has been searched effectively, that information will 
be of extremely high value to CCEP research. Additionally, any comprehensive list of 
correlations between risk factors and medical outcomes will be of value to PGW participants and 
the medical practitioners providing their ongoing medical care. 


C. WHAT DID DaMI FIND? 


DaMI has proven, by the standards of genetic algorithm theory, that it has studied the 
CCEP database quickly, intelligently, and comprehensively. All of the theory and development 
strategies now come down to one question, “What did we learn?” Computational results so far 
suggest that our system has succeeded at the given tasks, requiring relatively few resources. 
Experiments reveal no single syndrome, but numerous correlations do exist that require 
additional clinical analysis. 

Based on DaMI research, there is no indication that a single syndrome or other medical 
entity is causing wide-spread adverse health ramifications among a significant cross-section of 
PGW participants in the CCEP program. By “significant,” we mean that no group of over 100 
participants, sharing a common reported exposure/demographic information, exhibit a unique set 
of reported symptoms and/or outcome diagnoses. Keep in mind that only the 21 most frequently 
reported diagnoses (and combinations of these) have been tested to date. This does not mean that 
a syndrome cannot exist, but the data collected by CCEP and specifically studied by this research 
does not indicate such a correlation. 

. There are, however, numerous correlations of exposure/demographic information and 
associated symptoms/diagnoses which suggest that smaller groups may share common health 
conditions based on shared exposure to common health risk factors. These associations are based 


solely on statistical correlation; therefore, a final determination is withheld pending review of the 
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information by medical professionals. In any case, the examined data suggests a need for further 
research. 

The number of correlations found by DaMI is quite large; we have resisted summarizing 
hypotheses to preserve the robustness of the information. Therefore, the challenge of filtering 
and reporting awaits the input of CCEP researchers. Each exposure-to-diagnosis run has 
produced around 4,500 hypotheses, and each exposure-to-symptom mun has produced about 6,100 
hypotheses. In each case, the three sets of rules are combined into a single hypothesis set (with 
duplicates removed). The information has been further refined, subject to the following criteria: 


e Hypotheses applying to fewer than five individuals in the sample set have been 
removed to prevent undue influence by single outliers. By definition, a syndrome is 
a medical condition shared by a number of individuals. 

e Hypotheses are derived from a randomly selected 45% sample (without replacement) 
subset of the entire CCEP database. These hypotheses are tested against a separate 
45% (independent) partition of the CCEP database. Hypotheses whose fitness 
measure in the second (verification) sample differed from the fitness measure from 
the original sample by more than 20% have been eliminated. Fitness measures 
which remain constant over both the original and verification sample are called 
duplicable, suggesting they hold true for the entire database and are not a statistical 


anomaly. 


The application of the aforementioned selection criteria has resulted in a set of 2,653 candidate 
hypotheses conceming exposure-to-diagnoses and 4,959 hypotheses conceming exposure-to- 
symptoms. No minimum fitness measure threshold has been applied because the modified j- 
measure is an arbitrary score, suitable for ranking the order of interest of competing hypotheses. 
The fitness measure may not be attached to a specific interest “level.” Obviously, a great number 
of the hypotheses having low fitness measures do not contain correlations strong enough to 
support strong research attention. For this reason and for the sake of brevity, only the 100 
highest fitness hypotheses of each study are included in Appendix C and discussed in the next 


two result summary sections. 
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These two sections will discuss the highlights and some specific hypotheses from both 
the exposure-to-diagnosis and exposure-to-symptom studies. The exposure-to-diagnosis and 
exposure-to-symptom results are each exciting for different reasons. The exposure-to-diagnosis 
study contains many high confidence correlations--hypotheses which are applicable to over 50% 
of the participants concerned. The exposure-to-diagnosis hypotheses contain few unexpected 
correlations, but clearly demonstrate the ability of DaMI to cull out extremely strong correlations 
from a “mountain” of data. The exposure-to-symptom results contain many unexpected 
hypotheses, but with somewhat lower correlation strength. The exposure-to-symptom results 
attest to the sensitivity of DaMI analysis and contain new (previously undiscovered) information 
which should attract expanded clinical research. 


1. Exposure-to-diagnosis Correlations 


The exposure-to-diagnosis study yields a large number of strong correlations (positive 
predictive values between exposure and diagnosis of over 50%) and provides corroberation to 
some intuitive aspects of medical relationships. Several new relationships have been identified, 
but few hold information that is unexpected by the non-medical analyst, at least when studied 
separately from associated symptoms. DaMI demonstrates a powerful Sb to cull strong 
correlations from a large body of data, and in that respect, the results are very exciting. It must 
be reiterated that only combinations of the 21 most frequently occurring diagnoses have been 
considered at this point. However, a restructuring of the CCEP diagnosis representation which 
groups like diagnoses (with differing ICD codes) may bear even more information. 

No single exposure or group of exposures appear(s) to dominate the resulting hypotheses 
set, unlike what will be seen in the exposure-to-symptom study. Several exposures (but no 
demographic attributes) appeared in many of the 100 highest fitness hypothesis. 19% of the 
hypotheses included participants who were wounded and another 19% included participants who 
saw casualties. Yet another 19% of hypotheses included participants who reported exposure to 
“other paints” and 12% reported exposures to nerve gas. At first, the fact that many hypotheses 
include wounded participants appears interesting because only 1% of participants in the CCEP 
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database have been wounded. Also, only 4% of CCEP participants report exposure to nerve gas, 
so that too seems to be highly represented in the hypotheses. Casualties and other paints in 
hypotheses are less surprising since both have been highly reported by CCEP participants (50% 
and 38% respectively). However, 37% of the hypotheses discovered include Post-traumatic 
Stress Disorder and 22% include Depression (CCEP, 1996, p 19). This high number of Psycho- 
social diagnosis prevalence in the hypothesis set decreases the surprise that many hypotheses 

E cem wounded participants (as the two are commonly associated). Surprisingly, Severe Sleep 
Apnea is included in 20% of the hypotheses. Sleep Apnea is a medical condition not commonly 
linked to any CCEP reported exposure. This leaves only the prevalence of reported Nerve Gas 
exposures and the diagnosis of Sleep Apnea in hypotheses as the only unexpected attributes, 
from a macro perspective. Reported nerve gas exposure is all the more unexpected because 
chemical alarms and mustard gas (similar participant concerns) are notably scarce from the 
hypotheses. It will be seen later that reported nerve gas exposure plays a significant role in the 
exposure-to-symptom study. Finally, it should be noted that oil and smoke, heat and smoke, 
Pyridistine Hydrobromide (Pb), and headaches are included in few hypotheses--all are factors 
receiving high attention in CCEP research. 

An explanation of the DaMI reporting format is included in Figure #20. While the space 
is not available to discuss even the 100 highest fitness hypotheses, several illustrative hypotheses 
are presented now in Figure #21. Especially in the exposures-to-diagnosis study, DaMI 
demonstrates the ability to unmask high level of association between exposure/demographic and 
diagnosis attributes. This association is not limited to high positive predictive value (high 
probability of then condition given the if condition), but is also able to look at the associations in 
reverse (high probability of if condition given the then condition) and examine the 
contraindications (if condition precludes the then condition) between exposures/demographics 
and diagnoses. An example of each association type 1s presented below. The medical 
professional is referred to Appendix C for a complete list of hypotheses. 
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Figure 20. How to Read a DaMI Report 


Reference Recorde Кесегфє Recards Recards Reverse Сатар ех Corp lex 
Number Matching IF" Matching “THEN” Matching Nat Matching Confidence Asachtbu Asschtlau 


u — Teen Нури Hyp et heat Factor Facter Verification 


— € RES true q alse 

ЇЕ SERICE" RH 
EN MECHANICAL LOW RACK PAIN=" ani MAJOR DEPRESSIONS"Y* ш5те |___17 | — enr 
UHS faise] 01 1471 


18 7728 


2.10 


IF. PESTICIDES="Y" and NONAF_FOOD="Y" RHS tue RHS falas 
HEN DID/OSTEDARTHRITIS="Y" and. SEVERE SLEEP APNEA="Y" LHS true a 3603 
UiSfase| 5 | am| 


22 7724 


(770%) 2.36 233 





Figure 21. Exposure-to-diagnosis Examples 
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As stated before, the exposure-to-diagnosis examples presented here demonstrate the 
capability of DaMI to dig into a “mountain” of data and find strong hypotheses. The examples 
selected for presentation here are selected to illustrate that capability. It is highly recommended 
that the medical professional examine all of the hypotheses (Appendix C) in detail. Figure 
#2 1(a) is a hypothesis of extremely high positive predictive value. The hypothesis states that 
94% of participants diagnosed with mechanical lower back pain and major depression served in 
the Army. 94% is an extremely high correlation for such a broad hypothesis (a specific diagnosis 
combination is linked to a single service). Note that both the fitness measure obtained using the 
analysis database (complex association factor) is quite close (2.39/2.10) to that of the verification 
database (complex association verification), suggesting that the rule holds for all participants (not 
a statistical anomaly). The hypothesis illustrated in figure #21(b) is much more specific, but is 
still quite strong. This hypothesis states that 77% of the participants diagnosed with 
DJD/Osteoarthritis and Severe Sleep Apnea reported eating Non-allied Forces 
food and reported exposure to pesticides. DaMI is capable of isolating strong data correlations, 
regardless of hypotheses specificity. 
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Figure 22. Exposure-to-diagnosis Examples 


The next two hypotheses are equally interesting, but are much more difficult to find 
using conventional search techniques. DaMI, using the Modified J-measure is able to see 
correlations which do not fit the high positive predictive value paradigm. The hypothesis in 
Figure #22(a) states that 18% of Marine participants reporting exposure to pesticides and malaria 
have been diagnosed with asthma. A positive predictive value of 18% does not jump out at the 
analyst and would therefore not figure prominently in a conventional analysis; however, DaMI 
notes that only 5.1% of all participants have been diagnosed with Asthma. This means that 
Marines reporting pesticide and malaria exposure are 3.5 times more likely to have been 
diagnosed with Asthma than the general CCEP participant population. In light of that fact, the 
18% positive predictive value of this hypothesis is indeed significant, and DaMI has assigned it a 
high fitness measure. The hypothesis in Figure #22(b) is an example of contraindication. Note 
that this hypothesis shows no high correlation in either direction. The hypothesis states that 2% 
of participants reporting no exposure to Pb and not viewing casualties have been diagnosed with 
Post-traumatic Stress Disorder (PTSD). The reader’s attention is directed to the matrix on the 
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right section of the hypothesis report. In 589 cases where the LHS is true, the RHS is false. 
Also, in 424 cases where the RHS is true, the LHS is false. 1,022 participants report information 
that in some way involves this hypothesis’ exposures or diagnosis. In 99% of those cases, the 
exposures exclude the diagnosis outcome. In plain English, not reporting exposure to Pb or 
casualties precludes a diagnosis of PTSD. This fact, although readily apparent to conventional 
analysis, is very informative because of its exclusive properties and is therefore flagged by 
DaMI. 

The exposure-to-diagnosis study hypotheses exemplify the ability of our genetic 
algorithm to find both strong, obvious correlations and more intricate associations in the CCEP 
database. Many of the hypotheses reinforce “common sense” medical knowledge, but remember 
that DaMI has discovered these hypotheses without the benefit of prior medical knowledge of 
any kind. In light of this success, serious attention should be directed toward those hypotheses 
presented that do not conform to present-day medical perceptions. 


2. Exposure-to-symptom Correlations 


The exposure-to-symptom study is more comprehensive than the diagnosis studies 
because the exposure-to-symptom runs consider every reported symptom category, not a top 
stratification. Many individual hypotheses contain new (or unexpected) correlations and there 
also several interesting trends revealed the about hypotheses as a group. This previously 
undiscovered information is of key interest to medical researchers. The author believes that this 
is the reason that exposure-to-symptom runs consistently take longer to converge and are 
somewhat less successful at reproducing than exposure-to-diagnosis runs. Even though the 
theoretical search space of exposure-to-symptom runs is smaller, the actual search space contains 
more represented combinations (because all attributes are included) and is therefore practically 
more difficult to solve. This explains the difference in run times for different studies noted 
previously. 

While the exposure-to-diagnosis runs contain several intuitively obvious correlations, the 
exposure-to-symptom runs produce several strong but “unexpected” trends. These unexpected 
trends take the form of pervasive exposure and symptom combinations appearing in many of the 
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highest fitness hypotheses, despite the fact that these combinations are not prevalent in the ССЕР 
database as a whole. These are the specific “threads” of information that DaMI has been 
designed to discover. 

Several exposure attributes appear many times in the highest fitness exposure-to- 
symptom hypotheses: 


e over 50% of the hypotheses include reported exposure to mustard gas (singly or in 
combination) 

e almost 25% include reported exposure to nerve gas 

e 14% include participants that were wounded in combat 

e 12% include participants reporting some form of pre-conflict reproductive 
difficulties. 


This is somewhat unusual because all of these attributes are reported relatively infrequently in the 
CCEP database as a whole. Mustard gas exposure has been reported by 2% of CCEP 
participants, nerve gas 6%, wounded in combat 2%, and pre-conflict reproductive difficulties 
5.5% (CCEP, 1996, p. 19). Finally, the combination of reported nerve gas exposure and pre- 
conflict reproductive difficulties occurs in 9% of the top hypotheses. Notably scarce are 
hypotheses involving actual combat, chemical alarms, scud attacks, race, service, or post-conflict 
reproductive difficulties. It is surprising that since pre- and post-conflict reproductive difficulties 
are so highly statistically correlated, that post-conflict reproductive difficulties do not appear in 
any of the top hypotheses. 

Similarly, the symptoms bleeding gums and weight loss are each included in over 50% 
of the hypotheses, and 44% of the hypotheses involve a combination of both bleeding gums and 
weight loss. Only 127 (or 1.6%) of the participants in the CCEP database subset studied (7746 
total participants) reported that specific combination of symptoms. It is extremely interesting 
that so many hypotheses involve bleeding gums and weight loss, when these two symptoms are 
so scarce in the CCEP database at large. Also noteworthy is the large number of hypotheses 
relating reported mustard gas exposure to bleeding gums and weight loss (44% of hypotheses) 
and nerve gas exposure and pre-conflict reproductive difficulties with bleeding gums (9% of 
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hypotheses). Notably scare in the hypotheses are hypotheses including joint pain, head aches, 
and fatigue, the symptoms most commonly elicited by physicians (CCEP, 1996, p. 20). 
While thesis constraints prohibit discussing all 100 of the highest fitness hypotheses, 


several are included to illustrate some of the correlations discovered (Figure # 23). 
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Figure #23. Exposures to Symptom Examples 


The hypothesis in Figure #23(a) is included to demonstrate that DaMI, without the aid 
of medical knowledge, will discover intuitively obvious (to medical researcher) correlations. 
This hypothesis states that 70% of Navy participants who report exposure to diesel fuel and 
mustard gas also complain of difficulty breathing. It is understandable that anyone perceiving an 
exposure to mustard gas and who works with diesel fuel may, at some time, have suffered from 
difficulty breathing. 

In Figure #23(b), it is noted that 21% of participants reporting exposure to nerve gas and 
pre-conflict reproductive difficulties complain of both bleeding gums and muscle pain. Note that 
the fitness measure (2.85) in the analysis database is very close to that of the verification 
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database (2.43), indicating that the hypothesis holds across different independent samples of the 
entire CCEP database. This hypothesis can be considered unexpected because this specific 
exposure combination is reported by only .5% of the participants and the symptomatology by 
only 3.9%. 

In Figure #23(c), 1t is noted that 9% of participants reporting exposure to nerve gas and 
mustard gas, complain of both bleeding gums and weight loss. As before, the fitness measures 
(2.77/2.41) of both the analysis and verification database are quite close. Also note that this 
hypothesis holds in both directions; 6% of participants reporting bleeding gums and weight loss 
reported exposure to nerve gas and mustard gas. This hypothesis is also considered unexpected 
because this specific exposure combination is reported by only 1% of the participants and the 
symptomatology by only 1.6%. 

In summation, the exposure-to-symptom study brings to light several correlations which 
warrant further clinical analysis. Interest lies, not only in the hypotheses themselves, but also in 
the high number of correlations involving rare combinations of exposures and symptoms. 


D. ARE THE RESULTS USEFUL TO MEDICAL 
PROFESSIONALS? 


The results of both the Exposure-to-diagnosis and Exposure-to-symptom studies and 
research methodology have been reviewed by Ph.D. Epidemiologists on the CCEP staff and the 
Director of the Deployment Surveillance Team. CCEP Epidemiologists feel that DaMI has great 
potential for “identifying previously unrecognized patterns of symptoms and diagnoses.” (CCEP, 
Sep 1996) They also agree that DaMI has already identified many associations in the CCEP 
database that have not been found by conventional methods. However, they strongly emphasize 
that DaMI result hypotheses must be subjected to a more detailed, epidemeological-based post- 
processing before they can be of practical use to the CCEP research effort. They recommend that 
future DaMI research efforts be more closely coordinated with CCEP epidemiologists. The 
bottom line is that the substantial potential of DaMI as a research tool has been recognized by the 
medical researchers and the research sponsor has directed that DaMI be included actively in the 
study of Desert Storm Syndrome with the closer involvement of CCEP epidemiologists. 
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VII. CONCLUSION 


After many months of theoretical development, genetic algorithm design, and fine 
tuning, DaMI has accomplished its goal--to comprehensively search the CCEP Desert Storm 
database and provide medical researchers with a subset of several thousand hypotheses for further 
investigation from the billions of possible combinations. DaMI has proven its ability to search 
an extremely large unstructured database and cull, in a reasonable amount of time, a subset of the 
highest interest rules within that database. DaMI has more to tell us about the CCEP database, as 
it can be retuned for different search priorities and measures of interest. It may also be applied to 
any number of similar bodies of medical and non-medical data. 

This research began with a formidable analysis problem and an idea that the usefulness 
of computer analysis could extend beyond the conventional paradigm of “number crunching.” 
The author believed that by imparting a genetic algorithm with a model of a human researcher's 
interest, that the genetic algorithm could intelligently attack a tremendous search problem and 
reduce it to a manageable size, given limited resources. We have taken a complex research 
question and unstructured database and formulated both into a workable representation of 
researcher interest and usable source of study. À genetic algorithm (DaMI) has been created 
which can perform a self-adapting, intelligent search with striking results. In short, DaMI has 
achieved our vision and exceeded our wildest expectations. This thesis has shown only one 
venture into this new realm of medical research, pre-emptive employment of genetic algorithm 
analysis; there are certainly many more adventures awaiting. 


A. LESSONS LEARNED 


. The author encountered few problems during this thesis process. This thesis involves a 
very high visibility and politically sensative subject, Desert Storm Syndrome. As such, there 
were numerous requirements for presentations and progress meetings in addition to the normal 


research challenges. Since the political obligations were linked to the feedback from the 
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sponsoring agency they could not be ignored; this placed a very high time demand on the author. 


Also, the sponsoring agency is located in Washington, D.C., so a great deal of travel and remote 


communication was required to ensure adequete project coordination. Finally, feedback for 


medical researchers in the field was very difficult to obtain because of their diverse geographic 


locations and limited availability. 


The author has learned several valuable lessons from the thesis process: 


When doing a thesis involving data analysis, do not wait for results to start writing the thesis. 
A great deal of the thesis itself describes the theoretical basis and methodology of the 
research, and therefore, can be written before final results are achieved. The pressure of 
"doing the write-up" is a serious burden to good analysis and writing early helps to alleviate 
that pressure. 

If the thesis is directly funded by an outside agency (in my case the CCEP), it is important to 
clearly identify a liaison at that agency. In my case, there was not a clear procedure for 
information exchange established during the first half of the project, which made 
coordination haphazard. Once a clear coordination mechanism was put in place, the thesis 
process became much smoother. 

It is critical that a researcher have a sounding board who is not directly attached to the 
research. It was very easy for me to become so engrossed in the problem, that I began 
missing glaring solutions. I was lucky to have a single individual (not a genetic algorithm or 
medical expert per say) who reality checked my research and reviewed my thesis throughout 
my research. This feedback has proven invaluable to the quality of my thesis and the success 


of my research. 


B. RECOMMENDATIONS FOR FUTURE RESEARCH 


The success of DaMI opens the door to countless opportunities for future research. Two 


areas of study remain to be explored in the CCEP database: 
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e Analysis of demographic/exposure and a restructured diagnosis set. Efforts are 
currently underway to regroup participant diagnosis information so that similar 
diagnoses (even those with vastly divergent ICD codes) are grouped together. This 
will allow DaMI to analyze a majority of diagnoses, as opposed to the top 21 
diagnoses as presented in this thesis. 

ə Analysis of time/motion study of units and their locations during the Persian Gulf 
Conflict. Since in many cases units are homogenous in location and therefore 
exposure to health risks, an analysis of the CCEP participants” unit location in time 
and associated symptoms and/or diagnoses should prove quite fruitful. 


It should be obvious that DaMI has not been created with the sole intent of searching for 
a Desert Storm Syndrome. It is applicable to many other large, unstructured databases of 
medical and non-medical data. Aside from examining other bodies of data, there are several 
areas to investigate conceming DaMI itself: 


e Comparison of DaMI performance with other commercial data mining software and 
other data mining techniques (like regression analysis, cluster analysis, and neural 
networks). 

e Modification of DaMI’s statistical package to use alternative fitness functions, such 
as Chi-square instead of just the Modified J-measure. 

e Enhancement of the DaMI genetic algorithm to utilize parallel-processing for 
statistical computations. Clearly using a single PC is less efficient than a group of 
PC nodes operating simultaneously. This will dramatically increase search speed 
without increasing the complexity of computer hardware required. 

е Rewriting of the DaMI code into C++ or Ada, so that it can run on a higher capacity 
computer platform. Of course, this will increase efficiency, but will make the 


algorithm more restrictive (less portable) in terms of operating platforms. 


83 


[THIS PAGE INTENTIONALLY LEFT BLANK] 


84 


APPENDIX A. CCEP DATA DICTIONARIES AND DATA 
COLLECTION METHODOLOGY 


DATA DICTIONARY OF CCEP DATABASE 
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[93[RASH_DURA  Number(Dou|S [attribute [number confuses algo [yes/no 
[94/SLEEP_DTE [Date/Time |8 [maybe question info value |LOFR | 
7 95|БІЕЕР БОНА [Number (Dou|8 [attribute [number confuses algo [yes/no 
[—96|WEIGH_DTE 7 |Бае/іте |8 [maybe [question info value |LOFR | 
[  97|WEIGH DURA  — [Number(Dou|8 (attribute [number confuses algo [yes/no 
[—98]OTHR1_COMP [rea | ___по [eamtcorrlate text [delete 
[ 99|OTHRI DTE [Date/Time |8 [no [eantcorrelatetext [delete ” 
[ 100OTHRT DURA [Number(Douj& [no [eant correlate tet delete _ 
[ 10!|OTHR2 COMP — ем №0 [no [cant corelate tet [delete 
m —  - 


7 102|0ТНЕ2 DTE [Date/Time je | 
7 103|ОТНЕ2 DURA  Number(Doujà [no [cant correlate text [delete 

[ 10M4|OTHR3 COMP — Тем [20 |no  |cantcomeltelex delete — 
[ 105|OTHR3 DTE [Date/Time в  |no  cantcomelaeex delete 
[ 106|OTHRS DURA  |Number(Douj8 [no [cant correlate text delete _ 
[ A07[OTHR4 COMP (Tea  |20 [no [camtcorrelatetext delete | 


Q. 


[ 108|OTHR4 DTE — Date/rime [8  |no [cantcorelatetext [delete | 
| 109|OTHR4 DURA |Number(Dou[B ^ |no  |cantcorelatetext — |delte — 
| MO|PRLDAG ^  j|TeX O шш  — |n hea delete - 
( WIPRIICD 4 6 ан | 
| 112[SEC_DIAGT [тт ш шю = delete | 
ОЗ БЕСС [тет 6 [Ras ә | | 
| 1145 С 62 Теж ___|40 по __ (ex delete | 
|_115|5ЕС 1сб2 ______|Тем ____|-  — |RHS jas | — 
7 116|БЕС DiaAG3 — |Te« шошо по jet delete | 
| 17|sEC ICD) —— |tex B [RHS TC 
118| БЕС Срба ____"|Тем ___(40 _ по Дей [delete | 
ОПО БЕСС |7 6 J|RHS (Шаю ||0- 
120| БЕС DIAGS —  |Tex 40 [no Чем 
_121|5ЕС 1605 ______|Теж ____|- ____|ЕН5 _____| Балк EM 
| i22|SEC DIAGO [Text — 40 jo  |e [delete _ 
[ dP3|SEC-IODO — . теме 16  - |RHS Бае е И 
| i24|ALLER CONS [Text |! [no 7 |шиезіюпіліо уаше delete | 
| 125|AUDIO CONS Te« — || |o [questioninfo value [delete 
| i28|CARD| CONS |TeX — |i [no [question info value (еме _ 
127|DENTLCONS |Тех |i [no [questioninfovalue [delete 
| 128|DERMA CONS Мм || [no [question info value [delete — 
| 129EARNT CONS Мм fi [no [question info value [delete | 
130[ENDOC CONS — |Te« |i [no [question info value delete 
| 131[GASTR_CONS [Text [i [no [questioninfovalue [delete | 
[ i32/]HEMAT CONS [Text 1 [no [question info vale [delete | 
133 INFEC_CONS Text no question info value delete 


| 

















134 NEPHR_CONS Text 1 ¡no [question info value delete 
“136 МЕОКО: СОҚ Та mo [question info value — delete — 
з оссие-сон [tes п e question info value [delete 

ORE o NEN к “ЛП 
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DATASTRU.XLS 


38 PPSYCHCONS Tex |! mo question info value [delete 

139 |PTEST_CONS ___|Тем | по  (questioninfovale [delete 

140|RHEUM_CONS |Ted И № question info value delete 

МОЕ Ом — Ted ||  |no  jquestoninfovalue [delete | 
142|DIAG_DTE |баетіте |8  |no  auestion info value 
143[DIAG_DONE [ед [no [questioninfo value [delete ” 
144|PTQS_DONE ——  |Ted no [questioninfo value [delete | 
145|PROS_DONE Ted || no [questioninfo value 
146|IREL_DONE  |Ted |1 Jno [question info value [delete | 
147[DECL_DONE  |Te« | [no question info value [delete 

тав|НОМЕ АООЕА ____|Тем ____|з0 ___по ______|ртмасу ас 
140 HOME ADDR? |Tex 30 __|о _____|расуаа ________|Чејене __- 
150|НОМЕ ТОЙЫ  |Ted |. по. Iprivacyacı [delete 

151[HOME-STATE Ted 32 вето [о 
!52HOME ZIP Ted 5 [о (тоос врео deete _ 
153|WORK_ PHONE [Tex 2. шю. |рімасуай delete _ 
or ie ora po passe 
155|DCFORM DTE Date/Time [8 | 
156 [STARTLATER Тї _ || no [noinovaue delete — 
157|WHENTOCALL [Text |5 |no [noifovaue [delete _ 
I5BDECLINE — Ted | no [noifovaue delete _ 
159 WITHDRAW — Tex по no info value [еее _ 
I60EVAL COMP [Tet |497 [noinfovalue [delete 

АЛЕНЕ ем | atribute ? _jauestion info value | | 
AAA ааа 
163|PO_EVALDTE [Date/Time 6 по. noinfovaue [еее _ 
164|MIL ADORTI — Ted —— 30 jno [no info value [delete _ 
165|MIL'ADOR2 — Ted 30 no noifovaue [delete | 
166|MIL STATE —  |Tex — 3 no no info value [delete | 
тетјмсдР _______|ем ___  |no поі маме [delete | 
Ó68|CHECKL DTE — [Date/Time |8 [no [no info value delete 
I69|REPORT DTE — Date/Time j8 |no [no info vae [delete — 
fTOREPORT.TIM Тт“ | no jnoifovaue delete | 
TM|PRIORJAN —  |Ted її. no noifovaue [delete — 
ITZIREFUSED — Ted — || no [noifovalue [delete | 
ATA |NEGLECTED Ted по |noifovaue [delete | 
LTT4[EDS VIEWED —  |YesNo — || no |noifovaue delete _ 
[A75cF MiSSIN [Tet [1 |n o _|noinfovalue [delete ” 
Ыс Te4 |8 аны ү | 
ОРНА БЕ ______|ем — || —  |no  [noifovaue [delete | 
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В. DATA COLLECTION METHODS 


This section is quoted directly from (CCEP, 1996, pp. 13-14) 


Participants may enroll in the CCEP by calling a toll-free number (1-800-796-9699), 
which provides information and referrals to individuals requesting medical evaluations or by 
contacting their local military medical treatment facility (MTF). All MHSS eligible beneficiaries 
are eligible for the CCEP. For eligibility in the CCEP, a PGW veteran (or dependent) must have 
been eligible for DoD health care in June 1994 or later. 

Once an individual is referred, the CCEP provides a two-phase, comprehensive medical 
evaluation, with Phase I being conducted at one of 134 local MTFs. Phase I (when required) is 
conducted at one of 14 regional medical centers (RMCs). The medical review includes questions 
about family history, health, occupation, and unique exposures in the Gulf War, as well as a 
structured review of symptoms. 

Once a participant has completed the examination processes, copies of examination 
results are forwarded to the CCEP Program Management Team (PMT), where they undergo 
quality assurance procedures, and the data are entered into the master CCEP database. 

Additionally, of those CCEP participants suffering chronic, debilitating symptoms, the 
DoD has established an SCC at Walter Reed Army Medical Center and will have a second center 
opening in mid 1996 at Wilford Hall Medical Center, Lackland AFT, Texas. 

The data, which were initially entered into a relational database, were translated into a 
statistical format for this (CCEP Report on 18,598 Participants) report. Various validity checks 
were conducted to ensure that the data were appropriated for interpretation. Statistical tests and 
descriptive analyses were conducted on various categories of participants, including those in 
theater during the Persian Gulf War, their spouses, and their children. Moreover, the CCEP 
participants who were in theater were compared to the PGW population as a whole and were 
stratified by units to compare those units with higher CCEP participation to those units with 
lower CCEP participation. Specific analyses concerning self-reported exposures, physician- 
elicited symptoms, diagnoses, self-reported reproductive outcomes, self-reported lost workdays, 
physical evaluation boards (PEBs), and program satisfaction were conducted. Additionally, a 


90 


comparative analysis with the NAMCS data was conducted using age, sex, race, ethnicity, and 
diagnostic code variables to more closely match the CCEP population. 


P 
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APPENDIX B. DATA DICTIONARY OF SELECTED DaMI FILES 
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Structure for table: 


Number of data records: 
Date of last update: 
Code Page: 
Fiel Field Name 
Nulls 
1 RULE 
No 
2 СЕ 
Мо 
3 СОМСЕ 
Мо 
4 GENERATN 
No 
5 SERVICE 
No 
6 SMOKE NOW 
No 
7 SMOKE PAST 
No 
8 OIL SMOKE 
No 
9 HEAT SMOKE 
No 
10 PASS SMOKE 
No 
11 JDIESL FUEL 
No 
12 CARC PAINT 
No 
13 OTHR_PAINT 
No 
14 OTHR_SOLVE 
Мо 
15 URANIUM 
No 
16 MICROWAVES 
No 
17 PESTICIDES 
No 
18 NERVE_GAS 
No 
19 PYRIDOSTIG 
No 
20 MUSTRD_GAS 
No 
21 CONTM_FOOD 
No 
22 CONTM_WATR 
No 
23 NONAF WATR 
No 
24  NONAF FOOD 
No 
25 ANTHRAX 
No 
26 BOTULISM 
No 
27 MALARIA 
No 
28 ACT COMBAT 
No 
29 WOUNDED 
No 
30 CASUALTIES 
No 
31 SCUD_ATTAC 
No 
32 CHEM ALARM 
No 
33 PO PRIOR 
No 
34 РО АЕТЕЕ 


Мо 


TA RESEARCH\ VFP\ VFPDOCS\ DAMISAMP.DBF 
0340 


Type 
Integer 


Numeric 


. Numeric 


Integer 

Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 


Character 


Dec 


Index 


Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 
Character 


БА АА с о с o t o o o 
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Structure for table: 
Number of data records: 


Date of last 
Code Page: 
Fiel 


кеі 


пе 


© © N QO о ы о ку 


кӛ мі һа қа 
O N mm ceo 


ого = 


update: 


Field Name 
Nulls 


RULE_NUMBE 
No 
NO_TRUE_LH 
No 
NO_TRUE_RH 
No 
NO_TRUE_BO 
No 
NO_FALSE_B 
No 
STANDARD_C 
No 
REVERSE_CF 
No 
COMPLEX_CF 
No 
VCOMPLEX 
No 
LHS_TEXT 


C:\ RESEARCH\ VFP\ VFPDOCS\ RULELIB.DBF 


08/04/% 
1252 


Туре 
Numeric 
Numeric 
Numeric 
Numeric 
Numeric 
Numeric 
Numeric 
Numeric 
Numeric 
Character 
Character 
Character 


Integer 


Width 


л л Ut л © © 9 о» © 


S E 


150 


415 


Dec 


N N N N 


Index 


Desc 


APPENDIX C. TOP 100 HYPOTHESES DISCOVERED BY 
EXPOSURES-TO-DIAGNOSIS AND EXPOSURE-TO-SYMPTOM 
STUDIES 
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